[infinispan-dev] ISPN 200

Thu Sep 16 13:21:09 EDT 2010

2010/9/15 Navin Surtani <nsurtani at redhat.com>:
>
>>
>> Also sending the query in broadcast is nice for a first
>> implementation, but this means you can scale the number of searched
>> items but we can't scale the number of queries, this should be
>> designed in such a way to make it possible in a future improvement to
>> send the query only to a subset o the nodes.
>>
>
> Well I don't know of any other way to run the query to be honest. If you
> imagine that you have several nodes running ISPN in DIST mode, and each
> node has it's own local index - we don't necessarily know where all of
> the objects are. Each of them has got a share of all the objects and
> possibly a share of the indexes, depending on config. So I don't see how
> we can optimise where the query is run.
>
> Naturally, things get easier when your index is a central one and all
> nodes have access to it. That just works simpler because you can just
> run the query on one node and then once you have the QueryHits you can
> then call a Cache.get() on all the nodes. I think :S.

I didn't understand how it's assumed the indexes are spread across the
nodes; I assume we all agree implicitly that indexes should be split
in several pieces, which are then replicated many times across the
cluster, so in a similar way to DIST.
So in case we use sharding many nodes have the "shard A" and many
others have the "shard B", so when you perform a query you don't have
to broadcast it but ensure you get an answer from any single node
having a copy of A and also from any single node having a copy of B -
so you don't get duplicates and don't involve necessarily all nodes.

On the opposite way if each node is owning a local-only index of all
information it happens to be storing, then the content is not
deterministic and also all nodes should re-index everything when
rehashing: doesn't seem the way to go.

But I think we can do much better; just a set of ideas right now,
hopefully they could work;
The design of a segment in Lucene is quite similar to a shard. This
means we could make good use of the optimisation and segment merging
features of Lucene, and finally if each node has a full segment you
might even be able to predict where to send the queries as it's
similar to a balanced tree.

Keep in mind that the current Infinispan Lucene Directory will chunk
each segment in pieces when they grow larger than a user defined
threshold, this is suboptimal as Lucene isn't aware of it but you can
also reconfigure Lucene's LogByteSizeMergePolicy to set a threshold to
avoid segments larger than this same value, so to really avoid chunks
and have full-segments distribution; so this chunking system is just
meant to ease the setup and be compatible to existing software using
Lucene, as a drop-in replacement, but not necessarily the way to go
for an efficient Query implementation.
I think we should think about implementing a distributed Query on top
of a non-chunked distributed segments index, as you can predict which
segments you're going to hit, and even in which order, so you can send
the Query to the appropriate nodes only, and also collect results
lazily and avoiding out of memory for sorting issues.
I assume it's possible in Infinispan to know which nodes own a
specific key, and send queries in round-robin to these; then each node
can open it's segment using the segment reader and fullfill the
request, as this segment is likely local (always if not during a
rehash, but then again still working).

Cheers,
Sanne

>
>
> --
> Navin Surtani
> Intern Infinispan
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>