[infinispan-dev] [hibernate-dev] Distributed queries

Mon Sep 21 08:59:24 EDT 2009

Yes, point taken.

Is there perhaps a way to only index an object on one node.  For
example, if each node new there were currently 3 copies, and it was
the node with the lowest id, for example, it would index the document.
 When a new node joins or a node fails, the strategy is re-applied and
the node-local indices are updated accordingly.

On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
<emmanuel at hibernate.org> wrote:
> Hello
> See inline
>
> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>
>> Hi guys,
>>
>> I've been following the distributed query stuff with interest, but
>> this is the first time I'm posting, so please excuse the lack of
>> intimate knowledge of Infinispan.  Basically, I have been working on a
>> project that could really do with the Holy Grail of a distributed
>> query-able cache and I really liked the look of using JBossCache + the
>> Lucene Directory implementation that Manik wrote a while back.  I then
>> noticed Infinispan and talk of building querying directly into the
>> project and figured that it would be worthwhile waiting to see how
>> that panned out.
>>
>> I've thought a bit about how something like this might work, I'm not
>> sure if this will be in any way helpful, but here goes:  I guess there
>> are two approaches:  1) store the index (or partitioned indices) in
>> the grid and sync it to a node to do a particular query or 2) each
>> node has an index for the data it currently caches.  We preferred the
>> second idea as it offers a natural way to partition the indices (i.e.
>> however infinispan is configured to do it).  The first option would
>> mean you end up with either a monolithic index in the grid, or
>> partitions based on, say, date, that have to be sync'd en-mass to
>> whichever node(s) are doing a query.  I realise that the second
>> technique would produce duplicates, but Im sure there would be a way
>> to eliminate dupes based on the object's uuid (something im pretty
>> sure infinispan already has a notion of).
>
> Well 2 looks nicer but I don't know an obvious way to solve the
> duplication issues:
>  - returning several times the same content does alter the scoring of
> other documents
>  - it prevent efficient pagination as somehow you need to jump
> several results.
>
>>
>> We would also need to come up with a way or normalising the scoring
>> across all partitions (regardless of which method is used).  I have
>> seen this done before, and it would basically involve, per-query,
>> finding out the term frequency of the various keywords across the
>> entire index, or at least enough of it to produce a representative
>> value.  This would be used to calculate the score for each hit when
>> doing the actual search, and thus the ranking.
>
> I believe Lucene does normalize the score properly when using the
> remote IndexSearcher as the normalization is done on the "client" side.
>
>>
>> We have had issues with index corruption in the past as well (probably
>> due to programming bugs rather than lucene).  Making each node
>> responsible for its own index will make it very easy to throw corrupt
>> indices away and re-generate new ones.
>>
>> I did take a look at the visitor stuff in Infinispan before, but I
>> wasn't really sure where the best place to hook into would be to find
>> out which objects are being stored locally or evicted.  If someone has
>> a good idea of where to start, I'd be happy to lend a hand to to this
>> effort!
>>
>> Ray
>>
>>
>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com
>> > wrote:
>>> I think you just stuck a pin in the bubble that normally says "magic
>>> happens here" ;)
>>>
>>> How much of this did you tackle regarding hibernate search that could
>>> be applied here?
>>>
>>> (you final point re duplication may have some "flexibility" I
>>> think ?)
>>>
>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>> <emmanuel at hibernate.org> wrote:
>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>
>>>> The hard parts with distributed queries (ie executed on a grid and
>>>> recomposed) are:
>>>>  - making sure you ask all the nodes where the index is distributed
>>>> (you can't miss a node)
>>>>  - find a way to index only a subset of the data in a given index
>>>> (on
>>>> a given node). Applying the Infinispan distribution routine to the
>>>> InfinispanDirectory does not do that, it chunks data arbitrarily.
>>>>  - be able to rebuild a given index on a givne node (ie remember
>>>> which element were indexed)
>>>>  - you need to find a way to distribute your data without
>>>> duplication. If a key is indexed multiple times, then you end up
>>>> with
>>>> duplicated results that can't trivially be de-duplicated.
>>>>
>>>> Happy thinking.
>>>>
>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>
>>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>>> I am still not entirely sure what I am asking, but look forward
>>>>>> for
>>>>>> your merged in changes (they are in another branch right now
>>>>>> yes?).
>>>>>>
>>>>>> Yes I mean querying objects - I was under the impression that
>>>>>> lucene
>>>>>> was used for the indexing of the data to service these queries?
>>>>>
>>>>> Sure, to clarify: there's work going on on two different aspects,
>>>>> which
>>>>> complement each other in the ideal setup:
>>>>>
>>>>> 1) Be able to query a Lucene index (wherever you store that) to
>>>>> find
>>>>> objects
>>>>> which are located inside Infinispan; this is about how to search
>>>>> them and how
>>>>> to maintain the index in synch with Infinispan's content.
>>>>>
>>>>> 2) Store a Lucene index inside Infinispan, instead of, for example,
>>>>> filesystem.
>>>>> In this case we're not concerned about what you index, the Lucene
>>>>> interface
>>>>> is the usual one and you should be able to replace the Directory
>>>>> implementation in existing applications.
>>>>>
>>>>> So 1) is the branch you've found, and Navin is working on that, 2)
>>>>> is not yet
>>>>> in subversion, the latest patch is attached to other thread by
>>>>> Łukasz,
>>>>> and is to be applied
>>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>>
>>>>>>
>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>> <nsurtani at redhat.com> wrote:
>>>>>>>
>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>
>>>>>>>> oh ok nice - could you point me at which branch to try to find
>>>>>>>> some
>>>>>>>> tests to play with?
>>>>>>>
>>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>>
>>>>>>> The eventual goal is to be able to have different
>>>>>>> configurations on
>>>>>>> how you want to index your data. Manik has given me the 'OK' to
>>>>>>> push a
>>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>>
>>>>>>> I'm kind-of pressed with getting the code working for this and
>>>>>>> also
>>>>>>> between moving house and lack of internet there I'll be a bit
>>>>>>> quiet.
>>>>>>> However, I'll get a wiki up by the end of the week about how this
>>>>>>> all
>>>>>>> works.
>>>>>>>
>>>>>>> However if you're not then I assume you're talking about using
>>>>>>> Lucene
>>>>>>> to index into Infinispan?
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>> regarding indexing and queries - is the current aim to not
>>>>>>>>>> require
>>>>>>>>>> that the index for the entire data grid exist on a single
>>>>>>>>>> node?
>>>>>>>>>>
>>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>>> indexes at
>>>>>>>>>> the moment is curious).
>>>>>>>>>
>>>>>>>>> Yes the concept is to store the Lucene index itself in the
>>>>>>>>> grid,
>>>>>>>>> so
>>>>>>>>> it will
>>>>>>>>> be distributed, and the segments you use most get cached
>>>>>>>>> locally.
>>>>>>>>> At the moment you have to select only one node to write to the
>>>>>>>>> index,
>>>>>>>>> but all other nodes should be able to read.
>>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael D Neale
>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>> _______________________________________________
>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Michael D Neale
>>>>>>>> home: www.michaelneale.net
>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>> Navin Surtani
>>>>>>>
>>>>>>> Intern Infinispan
>>>>>>> Intern JBoss Cache Searchable
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael D Neale
>>>>>> home: www.michaelneale.net
>>>>>> blog: michaelneale.blogspot.com
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hibernate-dev mailing list
>>>>> hibernate-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>>
>>> --
>>> Michael D Neale
>>> home: www.michaelneale.net
>>> blog: michaelneale.blogspot.com
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>>
>> --
>> Ray Hilton
>> -
>>         email: ray at wirestorm.net
>> melbourne: +61 (0) 3 9077 0513
>>       mobile: +61 (0) 430 484 708
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Ray Hilton
-
         email: ray at wirestorm.net
 melbourne: +61 (0) 3 9077 0513
       mobile: +61 (0) 430 484 708