[infinispan-dev] [hibernate-dev] Distributed queries

Mon Sep 21 09:09:54 EDT 2009

Im guessing that something similar already happens so that infinispan
can re-jigg the data around the grid.  Forgive my lack of intimate
knowledge of how infinispan works here, but at some point the data
that was hosted by a bad node needs to be re-distributed?

On Mon, Sep 21, 2009 at 11:02 PM, Emmanuel Bernard
<emmanuel at hibernate.org> wrote:
> could be possible. That would likely be chatty though each time a node
> comes or go.
> Typically when a node goes down potentially due to network error, you
> don't wanna be chatty I imagine ;)
>
> On 21 sept. 09, at 14:59, Ray Hilton wrote:
>
>> Yes, point taken.
>>
>> Is there perhaps a way to only index an object on one node.  For
>> example, if each node new there were currently 3 copies, and it was
>> the node with the lowest id, for example, it would index the document.
>> When a new node joins or a node fails, the strategy is re-applied and
>> the node-local indices are updated accordingly.
>>
>> On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
>> <emmanuel at hibernate.org> wrote:
>>> Hello
>>> See inline
>>>
>>> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I've been following the distributed query stuff with interest, but
>>>> this is the first time I'm posting, so please excuse the lack of
>>>> intimate knowledge of Infinispan.  Basically, I have been working
>>>> on a
>>>> project that could really do with the Holy Grail of a distributed
>>>> query-able cache and I really liked the look of using JBossCache +
>>>> the
>>>> Lucene Directory implementation that Manik wrote a while back.  I
>>>> then
>>>> noticed Infinispan and talk of building querying directly into the
>>>> project and figured that it would be worthwhile waiting to see how
>>>> that panned out.
>>>>
>>>> I've thought a bit about how something like this might work, I'm not
>>>> sure if this will be in any way helpful, but here goes:  I guess
>>>> there
>>>> are two approaches:  1) store the index (or partitioned indices) in
>>>> the grid and sync it to a node to do a particular query or 2) each
>>>> node has an index for the data it currently caches.  We preferred
>>>> the
>>>> second idea as it offers a natural way to partition the indices
>>>> (i.e.
>>>> however infinispan is configured to do it).  The first option would
>>>> mean you end up with either a monolithic index in the grid, or
>>>> partitions based on, say, date, that have to be sync'd en-mass to
>>>> whichever node(s) are doing a query.  I realise that the second
>>>> technique would produce duplicates, but Im sure there would be a way
>>>> to eliminate dupes based on the object's uuid (something im pretty
>>>> sure infinispan already has a notion of).
>>>
>>> Well 2 looks nicer but I don't know an obvious way to solve the
>>> duplication issues:
>>>  - returning several times the same content does alter the scoring of
>>> other documents
>>>  - it prevent efficient pagination as somehow you need to jump
>>> several results.
>>>
>>>>
>>>> We would also need to come up with a way or normalising the scoring
>>>> across all partitions (regardless of which method is used).  I have
>>>> seen this done before, and it would basically involve, per-query,
>>>> finding out the term frequency of the various keywords across the
>>>> entire index, or at least enough of it to produce a representative
>>>> value.  This would be used to calculate the score for each hit when
>>>> doing the actual search, and thus the ranking.
>>>
>>> I believe Lucene does normalize the score properly when using the
>>> remote IndexSearcher as the normalization is done on the "client"
>>> side.
>>>
>>>>
>>>> We have had issues with index corruption in the past as well
>>>> (probably
>>>> due to programming bugs rather than lucene).  Making each node
>>>> responsible for its own index will make it very easy to throw
>>>> corrupt
>>>> indices away and re-generate new ones.
>>>>
>>>> I did take a look at the visitor stuff in Infinispan before, but I
>>>> wasn't really sure where the best place to hook into would be to
>>>> find
>>>> out which objects are being stored locally or evicted.  If someone
>>>> has
>>>> a good idea of where to start, I'd be happy to lend a hand to to
>>>> this
>>>> effort!
>>>>
>>>> Ray
>>>>
>>>>
>>>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com
>>>>> wrote:
>>>>> I think you just stuck a pin in the bubble that normally says
>>>>> "magic
>>>>> happens here" ;)
>>>>>
>>>>> How much of this did you tackle regarding hibernate search that
>>>>> could
>>>>> be applied here?
>>>>>
>>>>> (you final point re duplication may have some "flexibility" I
>>>>> think ?)
>>>>>
>>>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>>>> <emmanuel at hibernate.org> wrote:
>>>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>>>
>>>>>> The hard parts with distributed queries (ie executed on a grid and
>>>>>> recomposed) are:
>>>>>>  - making sure you ask all the nodes where the index is
>>>>>> distributed
>>>>>> (you can't miss a node)
>>>>>>  - find a way to index only a subset of the data in a given index
>>>>>> (on
>>>>>> a given node). Applying the Infinispan distribution routine to the
>>>>>> InfinispanDirectory does not do that, it chunks data arbitrarily.
>>>>>>  - be able to rebuild a given index on a givne node (ie remember
>>>>>> which element were indexed)
>>>>>>  - you need to find a way to distribute your data without
>>>>>> duplication. If a key is indexed multiple times, then you end up
>>>>>> with
>>>>>> duplicated results that can't trivially be de-duplicated.
>>>>>>
>>>>>> Happy thinking.
>>>>>>
>>>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>>>
>>>>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>>>>> I am still not entirely sure what I am asking, but look forward
>>>>>>>> for
>>>>>>>> your merged in changes (they are in another branch right now
>>>>>>>> yes?).
>>>>>>>>
>>>>>>>> Yes I mean querying objects - I was under the impression that
>>>>>>>> lucene
>>>>>>>> was used for the indexing of the data to service these queries?
>>>>>>>
>>>>>>> Sure, to clarify: there's work going on on two different aspects,
>>>>>>> which
>>>>>>> complement each other in the ideal setup:
>>>>>>>
>>>>>>> 1) Be able to query a Lucene index (wherever you store that) to
>>>>>>> find
>>>>>>> objects
>>>>>>> which are located inside Infinispan; this is about how to search
>>>>>>> them and how
>>>>>>> to maintain the index in synch with Infinispan's content.
>>>>>>>
>>>>>>> 2) Store a Lucene index inside Infinispan, instead of, for
>>>>>>> example,
>>>>>>> filesystem.
>>>>>>> In this case we're not concerned about what you index, the Lucene
>>>>>>> interface
>>>>>>> is the usual one and you should be able to replace the Directory
>>>>>>> implementation in existing applications.
>>>>>>>
>>>>>>> So 1) is the branch you've found, and Navin is working on that,
>>>>>>> 2)
>>>>>>> is not yet
>>>>>>> in subversion, the latest patch is attached to other thread by
>>>>>>> Łukasz,
>>>>>>> and is to be applied
>>>>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>>>> <nsurtani at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>>>
>>>>>>>>>> oh ok nice - could you point me at which branch to try to find
>>>>>>>>>> some
>>>>>>>>>> tests to play with?
>>>>>>>>>
>>>>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>>>>
>>>>>>>>> The eventual goal is to be able to have different
>>>>>>>>> configurations on
>>>>>>>>> how you want to index your data. Manik has given me the 'OK' to
>>>>>>>>> push a
>>>>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>>>>
>>>>>>>>> I'm kind-of pressed with getting the code working for this and
>>>>>>>>> also
>>>>>>>>> between moving house and lack of internet there I'll be a bit
>>>>>>>>> quiet.
>>>>>>>>> However, I'll get a wiki up by the end of the week about how
>>>>>>>>> this
>>>>>>>>> all
>>>>>>>>> works.
>>>>>>>>>
>>>>>>>>> However if you're not then I assume you're talking about using
>>>>>>>>> Lucene
>>>>>>>>> to index into Infinispan?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>>>> regarding indexing and queries - is the current aim to not
>>>>>>>>>>>> require
>>>>>>>>>>>> that the index for the entire data grid exist on a single
>>>>>>>>>>>> node?
>>>>>>>>>>>>
>>>>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>>>>> indexes at
>>>>>>>>>>>> the moment is curious).
>>>>>>>>>>>
>>>>>>>>>>> Yes the concept is to store the Lucene index itself in the
>>>>>>>>>>> grid,
>>>>>>>>>>> so
>>>>>>>>>>> it will
>>>>>>>>>>> be distributed, and the segments you use most get cached
>>>>>>>>>>> locally.
>>>>>>>>>>> At the moment you have to select only one node to write to
>>>>>>>>>>> the
>>>>>>>>>>> index,
>>>>>>>>>>> but all other nodes should be able to read.
>>>>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael D Neale
>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>> _______________________________________________
>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>
>>>>>>>>> Navin Surtani
>>>>>>>>>
>>>>>>>>> Intern Infinispan
>>>>>>>>> Intern JBoss Cache Searchable
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Michael D Neale
>>>>>>>> home: www.michaelneale.net
>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> hibernate-dev mailing list
>>>>>>> hibernate-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Michael D Neale
>>>>> home: www.michaelneale.net
>>>>> blog: michaelneale.blogspot.com
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>>
>>>> --
>>>> Ray Hilton
>>>> -
>>>>         email: ray at wirestorm.net
>>>> melbourne: +61 (0) 3 9077 0513
>>>>       mobile: +61 (0) 430 484 708
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>>
>> --
>> Ray Hilton
>> -
>>         email: ray at wirestorm.net
>> melbourne: +61 (0) 3 9077 0513
>>       mobile: +61 (0) 430 484 708
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Ray Hilton
-
         email: ray at wirestorm.net
 melbourne: +61 (0) 3 9077 0513
       mobile: +61 (0) 430 484 708