[infinispan-dev] [hibernate-dev] Distributed queries

Mon Sep 21 04:06:05 EDT 2009

Hello
See inline

On 20 sept. 09, at 06:01, Ray Hilton wrote:

> Hi guys,
>
> I've been following the distributed query stuff with interest, but
> this is the first time I'm posting, so please excuse the lack of
> intimate knowledge of Infinispan.  Basically, I have been working on a
> project that could really do with the Holy Grail of a distributed
> query-able cache and I really liked the look of using JBossCache + the
> Lucene Directory implementation that Manik wrote a while back.  I then
> noticed Infinispan and talk of building querying directly into the
> project and figured that it would be worthwhile waiting to see how
> that panned out.
>
> I've thought a bit about how something like this might work, I'm not
> sure if this will be in any way helpful, but here goes:  I guess there
> are two approaches:  1) store the index (or partitioned indices) in
> the grid and sync it to a node to do a particular query or 2) each
> node has an index for the data it currently caches.  We preferred the
> second idea as it offers a natural way to partition the indices (i.e.
> however infinispan is configured to do it).  The first option would
> mean you end up with either a monolithic index in the grid, or
> partitions based on, say, date, that have to be sync'd en-mass to
> whichever node(s) are doing a query.  I realise that the second
> technique would produce duplicates, but Im sure there would be a way
> to eliminate dupes based on the object's uuid (something im pretty
> sure infinispan already has a notion of).

Well 2 looks nicer but I don't know an obvious way to solve the  
duplication issues:
  - returning several times the same content does alter the scoring of  
other documents
  - it prevent efficient pagination as somehow you need to jump  
several results.

>
> We would also need to come up with a way or normalising the scoring
> across all partitions (regardless of which method is used).  I have
> seen this done before, and it would basically involve, per-query,
> finding out the term frequency of the various keywords across the
> entire index, or at least enough of it to produce a representative
> value.  This would be used to calculate the score for each hit when
> doing the actual search, and thus the ranking.

I believe Lucene does normalize the score properly when using the  
remote IndexSearcher as the normalization is done on the "client" side.

>
> We have had issues with index corruption in the past as well (probably
> due to programming bugs rather than lucene).  Making each node
> responsible for its own index will make it very easy to throw corrupt
> indices away and re-generate new ones.
>
> I did take a look at the visitor stuff in Infinispan before, but I
> wasn't really sure where the best place to hook into would be to find
> out which objects are being stored locally or evicted.  If someone has
> a good idea of where to start, I'd be happy to lend a hand to to this
> effort!
>
> Ray
>
>
> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com 
> > wrote:
>> I think you just stuck a pin in the bubble that normally says "magic
>> happens here" ;)
>>
>> How much of this did you tackle regarding hibernate search that could
>> be applied here?
>>
>> (you final point re duplication may have some "flexibility" I  
>> think ?)
>>
>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>> <emmanuel at hibernate.org> wrote:
>>> Neither 1 nor 2 imply *distributed* queries.
>>>
>>> The hard parts with distributed queries (ie executed on a grid and
>>> recomposed) are:
>>>  - making sure you ask all the nodes where the index is distributed
>>> (you can't miss a node)
>>>  - find a way to index only a subset of the data in a given index  
>>> (on
>>> a given node). Applying the Infinispan distribution routine to the
>>> InfinispanDirectory does not do that, it chunks data arbitrarily.
>>>  - be able to rebuild a given index on a givne node (ie remember
>>> which element were indexed)
>>>  - you need to find a way to distribute your data without
>>> duplication. If a key is indexed multiple times, then you end up  
>>> with
>>> duplicated results that can't trivially be de-duplicated.
>>>
>>> Happy thinking.
>>>
>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>
>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>> I am still not entirely sure what I am asking, but look forward  
>>>>> for
>>>>> your merged in changes (they are in another branch right now  
>>>>> yes?).
>>>>>
>>>>> Yes I mean querying objects - I was under the impression that  
>>>>> lucene
>>>>> was used for the indexing of the data to service these queries?
>>>>
>>>> Sure, to clarify: there's work going on on two different aspects,
>>>> which
>>>> complement each other in the ideal setup:
>>>>
>>>> 1) Be able to query a Lucene index (wherever you store that) to  
>>>> find
>>>> objects
>>>> which are located inside Infinispan; this is about how to search
>>>> them and how
>>>> to maintain the index in synch with Infinispan's content.
>>>>
>>>> 2) Store a Lucene index inside Infinispan, instead of, for example,
>>>> filesystem.
>>>> In this case we're not concerned about what you index, the Lucene
>>>> interface
>>>> is the usual one and you should be able to replace the Directory
>>>> implementation in existing applications.
>>>>
>>>> So 1) is the branch you've found, and Navin is working on that, 2)
>>>> is not yet
>>>> in subversion, the latest patch is attached to other thread by
>>>> Łukasz,
>>>> and is to be applied
>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>
>>>>>
>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>> <nsurtani at redhat.com> wrote:
>>>>>>
>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>
>>>>>>> oh ok nice - could you point me at which branch to try to find  
>>>>>>> some
>>>>>>> tests to play with?
>>>>>>
>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>
>>>>>> The eventual goal is to be able to have different  
>>>>>> configurations on
>>>>>> how you want to index your data. Manik has given me the 'OK' to
>>>>>> push a
>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>
>>>>>> I'm kind-of pressed with getting the code working for this and  
>>>>>> also
>>>>>> between moving house and lack of internet there I'll be a bit  
>>>>>> quiet.
>>>>>> However, I'll get a wiki up by the end of the week about how this
>>>>>> all
>>>>>> works.
>>>>>>
>>>>>> However if you're not then I assume you're talking about using
>>>>>> Lucene
>>>>>> to index into Infinispan?
>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>> regarding indexing and queries - is the current aim to not
>>>>>>>>> require
>>>>>>>>> that the index for the entire data grid exist on a single  
>>>>>>>>> node?
>>>>>>>>>
>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>> indexes at
>>>>>>>>> the moment is curious).
>>>>>>>>
>>>>>>>> Yes the concept is to store the Lucene index itself in the  
>>>>>>>> grid,
>>>>>>>> so
>>>>>>>> it will
>>>>>>>> be distributed, and the segments you use most get cached  
>>>>>>>> locally.
>>>>>>>> At the moment you have to select only one node to write to the
>>>>>>>> index,
>>>>>>>> but all other nodes should be able to read.
>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Michael D Neale
>>>>>>>>> home: www.michaelneale.net
>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Michael D Neale
>>>>>>> home: www.michaelneale.net
>>>>>>> blog: michaelneale.blogspot.com
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>> Navin Surtani
>>>>>>
>>>>>> Intern Infinispan
>>>>>> Intern JBoss Cache Searchable
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Michael D Neale
>>>>> home: www.michaelneale.net
>>>>> blog: michaelneale.blogspot.com
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> hibernate-dev mailing list
>>>> hibernate-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>>
>> --
>> Michael D Neale
>> home: www.michaelneale.net
>> blog: michaelneale.blogspot.com
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> -- 
> Ray Hilton
> -
>         email: ray at wirestorm.net
> melbourne: +61 (0) 3 9077 0513
>       mobile: +61 (0) 430 484 708
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev