I agree with Emmanuel's approach, you have to be lazy but in chunks
for it, so you're able to advance selected streams you're being sent
and keep back the others according to selected sorting: also remember
a sorted result stream might not fit in the client's memory.
Lucene users are hit often by the memory requirements of a sort
operation, this would be very useful.
Also sending the query in broadcast is nice for a first
implementation, but this means you can scale the number of searched
items but we can't scale the number of queries, this should be
designed in such a way to make it possible in a future improvement to
send the query only to a subset o the nodes.
Cheers,
Sanne
2010/9/15 Emmanuel Bernard <emmanuel(a)hibernate.org>:
Ah yes sorry guys scratch what I said before. I forgot you were
distributing
the query load and were fully resolving the results before sending them back
(ie not the two step process done by default - load sorted ids then load the
necessary objects)
Well it's possible to have an algorithm that does something like that:
- load the first few results of each node in sorted order
- do a comparison between them on the client and return the highest results
as long as there are other results of lower value queued up coming from all
the other nodes (or that the end of result for a given node is reached)
- if you are not in this situation, wait for results
One of the problem with this algo is that the server has to keep the next
results ready (has to remember) unless you push everything on the client wo
waiting the next() calls which makes it not so lazy anymore :).
Note that in practice, it does not work for relevance unless the dataset is
known to be evenly distributed (term and stats wise) and we can compare the
score value (which is a good first approximation until Sanne think more
about the alternative he has explored with some other guys).
You also need to deduplicate data.
But I would do early perf comparisons between a solution where we load keys
to the client and then load objects vs this super-lazy approach. I'm not
convince the later is better.
On 15 sept. 2010, at 14:04, Israel Lacerra wrote:
>I think if we allways block iterator.next() until we have at least a value
> of each node , we can apply sort in this values (since every node already
> made this).
I mean, the idea is to make a "merge" of the results... and the next()
blocks until we have some results already sorted.
Emmanuel, do you think this can work?
On Wed, Sep 15, 2010 at 8:54 AM, Israel Lacerra <israeldl(a)gmail.com> wrote:
>
> >You give up any hope of implementing sorting (explicit or using
> > relevance).
>
> I think if we allways block iterator.next() until we have at least a value
> of each node , we can apply sort in this values (since every node already
> made this). That's what you said, Navin?
>
> On Wed, Sep 15, 2010 at 8:16 AM, Navin Surtani <nsurtani(a)redhat.com>
> wrote:
>>
>> Isn't that an issue that comes with lots of hits in a lazyIterator
>> anyway?
>>
>> I mean, if you're only going to load up a 'few hits at a time' then
you
>> can in theory only sort the few hits that you have loaded up anyway
>> right? :S
>>
>> On 15/09/10 12:01, Emmanuel Bernard wrote:
>> > You give up any hope of implementing sorting (explicit or using
>> > relevance).
>> >
>> > On 15 sept. 2010, at 12:16, Manik Surtani wrote:
>> >
>> >> How about something like:
>> >>
>> >> - Broadcast the query
>> >> - Every node creates the QueryHits inst, runs the query and collects
>> >> results. Starts streaming the results back immediately.
>> >> - The lazy iterator returns to the user immediately, and maintains an
>> >> internal cache of results coming in from N remote QueryHits instances
>> >> - iterator.next() blocks until this cache has available entries to
>> >> return.
>> >> - on an implementation level, the GetHItsCommand (or something like
>> >> that) could return with a single hit, or N hits, with a flag of
>> >> whether more hits are available or not.
>> >>
>> >> Does that help?
>> >>
>> >> Manik
>> >>
>> >>
>> >> On 14 Sep 2010, at 20:46, Israel Lacerra wrote:
>> >>
>> >>> Hi guys,
>> >>>
>> >>> I'm still thinking about it, but I don't have a really good
idea
>> >>> about the lazy iterator yet. The only way (that I see) I could make
>> >>> it more lazily is:
>> >>>
>> >>> - Broadcast the query.
>> >>> - Every node creates a QueryHits instance with the query and keep
it
>> >>> in a simple little cache (array, hash, etc)
>> >>> - A "state" of the query is created and every
lazyIterator.next()
>> >>> must send a command to a node and get the next hit (the next key).
>> >>> - After a certain time, the instances of queryHits
"dies".
>> >>>
>> >>> It seems to me that this is not too efficient. But I don't have
any
>> >>> other ideas.
>> >>>
>> >>> Do you have any suggestions about it?
>> >>>
>> >>> thanks!
>> >>> Israel
>> >>>
>>
>>
>> --
>> Navin Surtani
>> Intern Infinispan
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev