[infinispan-dev] Design change in Infinispan Query

Wed Feb 26 09:33:00 EST 2014

On 02/26/2014 04:20 PM, Mircea Markus wrote:
> On Feb 26, 2014, at 2:13 PM, Dan Berindei <dan.berindei at gmail.com> wrote:
>
>>
>>
>> On Wed, Feb 26, 2014 at 3:12 PM, Mircea Markus <mmarkus at redhat.com> wrote:
>>
>> On Feb 25, 2014, at 5:08 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
>>
>>> There also is the opposite problem to be considered, as Emmanuel
>>> suggested on 11/04/2012:
>>> you can't forbid the user to store the same object (same type and same
>>> id) in two different caches, where each Cache might be using different
>>> indexing options.
>>>
>>> If the "search service" is a global concept, and you run a query which
>>> matches object X, we'll return it to the user but he won't be able to
>>> figure out from which cache it's being sourced: is that ok?
>> Can't the user figure that out based on the way the query is built?
>> I mean the problem is similar with the databases: if address is both a table and an column in the USER table, then it's the query (select) that determines where from the address is returned.
>>
>> You mean the user should specify the cache name(s) when building the query?
> yes
Let's say multiple caches are specified when building the query. How can 
I tell (with current result api) where does the matching entity come 
from? I still think we should extend the result api in order to provide: 
1. the key of the entity, 2. the name of the originating cache.  The old 
result api that just gives you an Iterator<Object> over the matches 
should continue to exist because it's more efficient for the cases when 
the user does not need #1 and #2.

>
>> With a database you have to go a bit out of your way to select from more than one table at a time, normally you have just one primary table that you select from and the others are just to help you filter and transform that table. You also have to add some information about the source table yourself if you need it, otherwise the DB won't tell you what table the results are coming from:
>>
>> SELECT "table1" as source, id FROM table1
>> UNION ALL
>> SELECT "table2" as source, id FROM table2
>>
>> Adrian tells our current query API doesn't allow us to do projections with synthetic columns. On the other hand, we need to extend the current API to give us the entry key anyway, so it would be easy to extend it to give us the name of the cache as well.
>>
>>
>>> Ultimately this implies a query might return the same object X in
>>> multiple positions in the result list of the query; for example it
>>> might be the top result according to some criteria but also be the 5th
>>> result because of how it was indexed in a different case: maybe
>>> someone will find good use for this "capability" but I see it
>>> primarily as a source of confusion.
>> Curious if this cannot be source of data can/cannot be specified within the query.
>>
>> Right, the user should be able to scope a search to a single cache, or maybe to multiple caches, even if there is only one global index.
>>
>> But I think the same object can already be inserted twice in the same cache, only with a different key, so returning duplicates from a query is something the user already has to cope with.
>>
>>
>>> Finally, if we move the search service as a global component, there
>>> might be an impact in how we explain security: an ACL filter applied
>>> on one cache - or the index metadata produced by that cache - might
>>> not be applied in the same way by an entity being matched through a
>>> second cache.
>>> Not least a user's permission to access one cache (or not) will affect
>>> his results in a rather complex way.
>> I'll let Tristan comment more on this, but is this really different from an SQL database where you grant access on individual tables and run a query involving multiple of them?
>>
>> The difference would be that in a DB each table will have its own index(es), so they only have to check the permissions once and not for every row.
>>
>> OTOH, if we plan to support key-level permissions, that would require checking the permissions on each search result anyway, so this wouldn't cost us anything.
>>   
>>
>>> I'm wondering if we need to prevent such situations.
>>>
>>> Sanne
>>>
>>> On 25 February 2014 16:24, Mircea Markus <mmarkus at redhat.com> wrote:
>>>> On Feb 25, 2014, at 3:46 PM, Adrian Nistor <anistor at gmail.com> wrote:
>>>>
>>>>> They can do what they please. Either put multiple types in one basket or put them in separate caches (one type per cache). But allowing / recommending is one thing, mandating it is a different story.
>>>>>
>>>>> There's no reason to forbid _any_ of these scenarios / mandate one over the other! There was previously in this thread some suggestion of mandating the one type per cache usage. -1 for it
>>>> Agreed. I actually don't see how we can enforce people that declare Cache<Object,Object> not put whatever they want in it. Also makes total sense for smaller caches as it is easy to set up etc.
>>>> The debate in this email, the way I understood it, was: are/should people using multiple caches for storing data? If yes we should consider querying functionality spreading over multiple caches.
>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 5:08 PM, Mircea Markus <mmarkus at redhat.com> wrote:
>>>>>
>>>>> On Feb 25, 2014, at 9:28 AM, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>>>>>
>>>>>>> On 24 févr. 2014, at 17:39, Mircea Markus <mmarkus at redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Feb 17, 2014, at 10:13 PM, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>>>>>>>>
>>>>>>>> By the way, Mircea, Sanne and I had quite a long discussion about this one and the idea of one cache per entity. It turns out that the right (as in easy) solution does involve a higher level programming model like OGM provides. You can simulate it yourself using the Infinispan APIs but it is just cumbersome.
>>>>>>> Curious to hear the whole story :-)
>>>>>>> We cannot mandate all the suers to use OGM though, one of the reasons being OGM is not platform independent (hotrod).
>>>>>> Then solve all the issues I have raised with a magic wand and come back to me when you have done it, I'm interested.
>>>>> People are going to use infinispan with one cache per entity, because it makes sense:
>>>>> - different config (repl/dist | persistent/non-persistent) for different data types
>>>>> - have map/reduce tasks running only the Person entires not on Dog as well, when you want to select (Person) where age > 18
>>>>> I don't see a reason to forbid this, on the contrary. The way I see it the relation between (OGM, ISPN) <=> (Hibernate, JDBC). Indeed OGM would be a better abstraction and should be recommended as such for the Java clients, but ultimately we're a general purpose storage engine that is available to different platforms as well.
>>>>>
>>>>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> Cheers,