[infinispan-dev] ISPN 200

Fri Sep 17 04:53:18 EDT 2010

I think the whole point of ISPN-200 and being able to distribute the query itself is that we don't need to distribute the indexes.  Each node maintains indexes only for the data stored on that node locally.  

Agreed that reindexing during a rehash could be a challenge though.

On 16 Sep 2010, at 18:21, Sanne Grinovero wrote:

> 2010/9/15 Navin Surtani <nsurtani at redhat.com>:
>> 
>>> 
>>> Also sending the query in broadcast is nice for a first
>>> implementation, but this means you can scale the number of searched
>>> items but we can't scale the number of queries, this should be
>>> designed in such a way to make it possible in a future improvement to
>>> send the query only to a subset o the nodes.
>>> 
>> 
>> Well I don't know of any other way to run the query to be honest. If you
>> imagine that you have several nodes running ISPN in DIST mode, and each
>> node has it's own local index - we don't necessarily know where all of
>> the objects are. Each of them has got a share of all the objects and
>> possibly a share of the indexes, depending on config. So I don't see how
>> we can optimise where the query is run.
>> 
>> Naturally, things get easier when your index is a central one and all
>> nodes have access to it. That just works simpler because you can just
>> run the query on one node and then once you have the QueryHits you can
>> then call a Cache.get() on all the nodes. I think :S.
> 
> I didn't understand how it's assumed the indexes are spread across the
> nodes; I assume we all agree implicitly that indexes should be split
> in several pieces, which are then replicated many times across the
> cluster, so in a similar way to DIST.
> So in case we use sharding many nodes have the "shard A" and many
> others have the "shard B", so when you perform a query you don't have
> to broadcast it but ensure you get an answer from any single node
> having a copy of A and also from any single node having a copy of B -
> so you don't get duplicates and don't involve necessarily all nodes.
> 
> On the opposite way if each node is owning a local-only index of all
> information it happens to be storing, then the content is not
> deterministic and also all nodes should re-index everything when
> rehashing: doesn't seem the way to go.
> 
> But I think we can do much better; just a set of ideas right now,
> hopefully they could work;
> The design of a segment in Lucene is quite similar to a shard. This
> means we could make good use of the optimisation and segment merging
> features of Lucene, and finally if each node has a full segment you
> might even be able to predict where to send the queries as it's
> similar to a balanced tree.
> 
> Keep in mind that the current Infinispan Lucene Directory will chunk
> each segment in pieces when they grow larger than a user defined
> threshold, this is suboptimal as Lucene isn't aware of it but you can
> also reconfigure Lucene's LogByteSizeMergePolicy to set a threshold to
> avoid segments larger than this same value, so to really avoid chunks
> and have full-segments distribution; so this chunking system is just
> meant to ease the setup and be compatible to existing software using
> Lucene, as a drop-in replacement, but not necessarily the way to go
> for an efficient Query implementation.
> I think we should think about implementing a distributed Query on top
> of a non-chunked distributed segments index, as you can predict which
> segments you're going to hit, and even in which order, so you can send
> the Query to the appropriate nodes only, and also collect results
> lazily and avoiding out of memory for sorting issues.
> I assume it's possible in Infinispan to know which nodes own a
> specific key, and send queries in round-robin to these; then each node
> can open it's segment using the segment reader and fullfill the
> request, as this segment is likely local (always if not during a
> rehash, but then again still working).
> 
> Cheers,
> Sanne
> 
>> 
>> 
>> --
>> Navin Surtani
>> Intern Infinispan
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org