[infinispan-dev] About ISPN-200 Distributed Queries

Wed May 5 05:27:29 EDT 2010

Hi there

On 4 May 2010, at 20:42, Israel Lacerra wrote:

> I'm studying ISPN-200 cause I thinking about resolve this issue in my M. Sc. topic. About this, I want to make a couple of questions (and maybe they don't make sense):
> 
> - Currently, If we have "-Dinfinispan.query.indexLocalOnly=true" the indexes are just local, right? And if "-Dinfinispan.query.indexLocalOnly=false", the indexes are global shared. Am I right?

Yes.  Basically Lucene handles and stores the indexes.  Now you could have 2 scenarios.  Scenario 1: where each node has its own private, non-shared set of indexes.  Scenario 2: there is a shared, global index, where each node writes to and updates this global index (perhaps stored on NFS, etc).  The relevant scenario depends on how you configure Lucene.  

Now the switch in Infinispan controls which node(s) in the cluster actually do the indexing whenever there is a change in data in the cluster.  If you have configured Lucene to maintain non-shared indexes, then *every* node in the cache needs to update their own private index whenever there is a change in any entry, anywhere in the cluster.  -Dinfinispan.query.indexLocalOnly=false will force Infinispan nodes to index changes that happen anywhere in the cluster.

If the indexes are global and shared, then there is no need for each node to update the indexes.  Only the node that initiated the change should update the indexes, and -Dinfinispan.query.indexLocalOnly=true will force this behaviour.  

> - So, how ISPN-200 will work on this two possibilities? 

As for ISPN-200, this is part of what we need to think about.  Ideally, the only approach that will truly scale is for each node to maintain not just shared or non-shared indexes, but a fragment of the global index.  A fragment that pertains to just the data it owns.  So, assume we have this setup with 4 nodes:

Caches: {A, B, C, D}

Keys:

K1 -> {A, B}
K2 -> {B, C}
K3 -> {C, D}

A's index would have {K1}
B's index would have {K1, K2}
C's index would have {K2, K3}
D's index would have {K3}

So if we were to write a query that matches K1, that query would be sent to every node in the cluster and the results returned would look like:

A: {K1}
B: {K1}
C: {}
D: {}

Similarly, if we were to write a query that matches K1 and K2, that query would be sent to every node in the cluster and the results returned would look like:

A: {K1}
B: {K1, K2}
C: {K2}
D: {}

Now the tricky part will be to efficiently collate these partial results into a proper resultset to pass back to the user, including removing duplicates, proper ranking and ordering, etc.  

Hope this helps!

Cheers
Manik

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org