[infinispan-dev] ISPN 200
Sanne Grinovero
sanne.grinovero at gmail.com
Fri Sep 17 08:48:41 EDT 2010
2010/9/17 Manik Surtani <manik at jboss.org>:
> I think the whole point of ISPN-200 and being able to distribute the query itself is that we don't need to distribute the indexes. Each node maintains indexes only for the data stored on that node locally.
that depends on the point of view :)
As I personally see it you are still distributing the indexes, as each
node has a piece of all information stored in the grid, and to obtain
a result it's mandatory to merge them all - so this is clearly "a part
of the index".
The problem is that it's undefined which part it is, so you can't take
any advantage from it and have to browse all indexes to find matches:
the term "index" doesn't seem appropriate any more.
Using this structure as said you need de-duplication and broadcast to
all nodes, which is suboptimal as it seems that it's possible to avoid
it. I don't know how an optimal solution would look like, but I'd take
an advantage at least by avoiding these two issues as it seems
possible taking advantage from the segments structure - you would
really be able to scale.
I'm not against solving ISPN-200 by using the proposed approach; at
least we will have some code able to send a query command returning a
streaming result, so we can later improve on it trying out better
approaches; but I hope it could be made in such a way we can later
reshape it to avoid broadcasts.
Cheers,
Sanne
>
> Agreed that reindexing during a rehash could be a challenge though.
The goal is to have an efficient query right?
>
> On 16 Sep 2010, at 18:21, Sanne Grinovero wrote:
>
>> 2010/9/15 Navin Surtani <nsurtani at redhat.com>:
>>>
>>>>
>>>> Also sending the query in broadcast is nice for a first
>>>> implementation, but this means you can scale the number of searched
>>>> items but we can't scale the number of queries, this should be
>>>> designed in such a way to make it possible in a future improvement to
>>>> send the query only to a subset o the nodes.
>>>>
>>>
>>> Well I don't know of any other way to run the query to be honest. If you
>>> imagine that you have several nodes running ISPN in DIST mode, and each
>>> node has it's own local index - we don't necessarily know where all of
>>> the objects are. Each of them has got a share of all the objects and
>>> possibly a share of the indexes, depending on config. So I don't see how
>>> we can optimise where the query is run.
>>>
>>> Naturally, things get easier when your index is a central one and all
>>> nodes have access to it. That just works simpler because you can just
>>> run the query on one node and then once you have the QueryHits you can
>>> then call a Cache.get() on all the nodes. I think :S.
>>
>> I didn't understand how it's assumed the indexes are spread across the
>> nodes; I assume we all agree implicitly that indexes should be split
>> in several pieces, which are then replicated many times across the
>> cluster, so in a similar way to DIST.
>> So in case we use sharding many nodes have the "shard A" and many
>> others have the "shard B", so when you perform a query you don't have
>> to broadcast it but ensure you get an answer from any single node
>> having a copy of A and also from any single node having a copy of B -
>> so you don't get duplicates and don't involve necessarily all nodes.
>>
>> On the opposite way if each node is owning a local-only index of all
>> information it happens to be storing, then the content is not
>> deterministic and also all nodes should re-index everything when
>> rehashing: doesn't seem the way to go.
>>
>> But I think we can do much better; just a set of ideas right now,
>> hopefully they could work;
>> The design of a segment in Lucene is quite similar to a shard. This
>> means we could make good use of the optimisation and segment merging
>> features of Lucene, and finally if each node has a full segment you
>> might even be able to predict where to send the queries as it's
>> similar to a balanced tree.
>>
>> Keep in mind that the current Infinispan Lucene Directory will chunk
>> each segment in pieces when they grow larger than a user defined
>> threshold, this is suboptimal as Lucene isn't aware of it but you can
>> also reconfigure Lucene's LogByteSizeMergePolicy to set a threshold to
>> avoid segments larger than this same value, so to really avoid chunks
>> and have full-segments distribution; so this chunking system is just
>> meant to ease the setup and be compatible to existing software using
>> Lucene, as a drop-in replacement, but not necessarily the way to go
>> for an efficient Query implementation.
>> I think we should think about implementing a distributed Query on top
>> of a non-chunked distributed segments index, as you can predict which
>> segments you're going to hit, and even in which order, so you can send
>> the Query to the appropriate nodes only, and also collect results
>> lazily and avoiding out of memory for sorting issues.
>> I assume it's possible in Infinispan to know which nodes own a
>> specific key, and send queries in round-robin to these; then each node
>> can open it's segment using the segment reader and fullfill the
>> request, as this segment is likely local (always if not during a
>> rehash, but then again still working).
>>
>> Cheers,
>> Sanne
>>
>>>
>>>
>>> --
>>> Navin Surtani
>>> Intern Infinispan
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik at jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
> http://www.infinispan.org
> http://www.jbosscache.org
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
More information about the infinispan-dev
mailing list