[infinispan-dev] [hibernate-dev] Distributed queries

Tue Sep 22 05:23:23 EDT 2009

On 22 Sep 2009, at 03:00, Michael Neale wrote:

> I guess that could make sense for some cases - if the data changes are
> small-ish, and the index calculation cost isn't huge... I guess if the
> objects you are parking in infinispan are large, then it could end up
> more efficient to only do the index once and then spread it around
> (sticky to the data that it represents).

I believe the plan is to build in a few different configs that would  
work for different use-cases. For example, a lot of "small" objects  
but not necessarily many nodes so they all share the same index or a  
lot of "big" objects sitting on disk on each individual node (where  
replication could be expensive).

Or this is what I understood when speaking with Manik a couple of  
weeks ago that is :-).

>
> On Mon, Sep 21, 2009 at 11:09 PM, Ray Hilton <ray at wirestorm.net>  
> wrote:
>> Im guessing that something similar already happens so that infinispan
>> can re-jigg the data around the grid.  Forgive my lack of intimate
>> knowledge of how infinispan works here, but at some point the data
>> that was hosted by a bad node needs to be re-distributed?
>>
>> On Mon, Sep 21, 2009 at 11:02 PM, Emmanuel Bernard
>> <emmanuel at hibernate.org> wrote:
>>> could be possible. That would likely be chatty though each time a  
>>> node
>>> comes or go.
>>> Typically when a node goes down potentially due to network error,  
>>> you
>>> don't wanna be chatty I imagine ;)
>>>
>>> On 21 sept. 09, at 14:59, Ray Hilton wrote:
>>>
>>>> Yes, point taken.
>>>>
>>>> Is there perhaps a way to only index an object on one node.  For
>>>> example, if each node new there were currently 3 copies, and it was
>>>> the node with the lowest id, for example, it would index the  
>>>> document.
>>>> When a new node joins or a node fails, the strategy is re-applied  
>>>> and
>>>> the node-local indices are updated accordingly.
>>>>
>>>> On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
>>>> <emmanuel at hibernate.org> wrote:
>>>>> Hello
>>>>> See inline
>>>>>
>>>>> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I've been following the distributed query stuff with interest,  
>>>>>> but
>>>>>> this is the first time I'm posting, so please excuse the lack of
>>>>>> intimate knowledge of Infinispan.  Basically, I have been working
>>>>>> on a
>>>>>> project that could really do with the Holy Grail of a distributed
>>>>>> query-able cache and I really liked the look of using  
>>>>>> JBossCache +
>>>>>> the
>>>>>> Lucene Directory implementation that Manik wrote a while back.  I
>>>>>> then
>>>>>> noticed Infinispan and talk of building querying directly into  
>>>>>> the
>>>>>> project and figured that it would be worthwhile waiting to see  
>>>>>> how
>>>>>> that panned out.
>>>>>>
>>>>>> I've thought a bit about how something like this might work,  
>>>>>> I'm not
>>>>>> sure if this will be in any way helpful, but here goes:  I guess
>>>>>> there
>>>>>> are two approaches:  1) store the index (or partitioned  
>>>>>> indices) in
>>>>>> the grid and sync it to a node to do a particular query or 2)  
>>>>>> each
>>>>>> node has an index for the data it currently caches.  We preferred
>>>>>> the
>>>>>> second idea as it offers a natural way to partition the indices
>>>>>> (i.e.
>>>>>> however infinispan is configured to do it).  The first option  
>>>>>> would
>>>>>> mean you end up with either a monolithic index in the grid, or
>>>>>> partitions based on, say, date, that have to be sync'd en-mass to
>>>>>> whichever node(s) are doing a query.  I realise that the second
>>>>>> technique would produce duplicates, but Im sure there would be  
>>>>>> a way
>>>>>> to eliminate dupes based on the object's uuid (something im  
>>>>>> pretty
>>>>>> sure infinispan already has a notion of).
>>>>>
>>>>> Well 2 looks nicer but I don't know an obvious way to solve the
>>>>> duplication issues:
>>>>>  - returning several times the same content does alter the  
>>>>> scoring of
>>>>> other documents
>>>>>  - it prevent efficient pagination as somehow you need to jump
>>>>> several results.
>>>>>
>>>>>>
>>>>>> We would also need to come up with a way or normalising the  
>>>>>> scoring
>>>>>> across all partitions (regardless of which method is used).  I  
>>>>>> have
>>>>>> seen this done before, and it would basically involve, per-query,
>>>>>> finding out the term frequency of the various keywords across the
>>>>>> entire index, or at least enough of it to produce a  
>>>>>> representative
>>>>>> value.  This would be used to calculate the score for each hit  
>>>>>> when
>>>>>> doing the actual search, and thus the ranking.
>>>>>
>>>>> I believe Lucene does normalize the score properly when using the
>>>>> remote IndexSearcher as the normalization is done on the "client"
>>>>> side.
>>>>>
>>>>>>
>>>>>> We have had issues with index corruption in the past as well
>>>>>> (probably
>>>>>> due to programming bugs rather than lucene).  Making each node
>>>>>> responsible for its own index will make it very easy to throw
>>>>>> corrupt
>>>>>> indices away and re-generate new ones.
>>>>>>
>>>>>> I did take a look at the visitor stuff in Infinispan before,  
>>>>>> but I
>>>>>> wasn't really sure where the best place to hook into would be to
>>>>>> find
>>>>>> out which objects are being stored locally or evicted.  If  
>>>>>> someone
>>>>>> has
>>>>>> a good idea of where to start, I'd be happy to lend a hand to to
>>>>>> this
>>>>>> effort!
>>>>>>
>>>>>> Ray
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com
>>>>>>> wrote:
>>>>>>> I think you just stuck a pin in the bubble that normally says
>>>>>>> "magic
>>>>>>> happens here" ;)
>>>>>>>
>>>>>>> How much of this did you tackle regarding hibernate search that
>>>>>>> could
>>>>>>> be applied here?
>>>>>>>
>>>>>>> (you final point re duplication may have some "flexibility" I
>>>>>>> think ?)
>>>>>>>
>>>>>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>>>>>> <emmanuel at hibernate.org> wrote:
>>>>>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>>>>>
>>>>>>>> The hard parts with distributed queries (ie executed on a  
>>>>>>>> grid and
>>>>>>>> recomposed) are:
>>>>>>>>  - making sure you ask all the nodes where the index is
>>>>>>>> distributed
>>>>>>>> (you can't miss a node)
>>>>>>>>  - find a way to index only a subset of the data in a given  
>>>>>>>> index
>>>>>>>> (on
>>>>>>>> a given node). Applying the Infinispan distribution routine  
>>>>>>>> to the
>>>>>>>> InfinispanDirectory does not do that, it chunks data  
>>>>>>>> arbitrarily.
>>>>>>>>  - be able to rebuild a given index on a givne node (ie  
>>>>>>>> remember
>>>>>>>> which element were indexed)
>>>>>>>>  - you need to find a way to distribute your data without
>>>>>>>> duplication. If a key is indexed multiple times, then you end  
>>>>>>>> up
>>>>>>>> with
>>>>>>>> duplicated results that can't trivially be de-duplicated.
>>>>>>>>
>>>>>>>> Happy thinking.
>>>>>>>>
>>>>>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>>>>>
>>>>>>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>> I am still not entirely sure what I am asking, but look  
>>>>>>>>>> forward
>>>>>>>>>> for
>>>>>>>>>> your merged in changes (they are in another branch right now
>>>>>>>>>> yes?).
>>>>>>>>>>
>>>>>>>>>> Yes I mean querying objects - I was under the impression that
>>>>>>>>>> lucene
>>>>>>>>>> was used for the indexing of the data to service these  
>>>>>>>>>> queries?
>>>>>>>>>
>>>>>>>>> Sure, to clarify: there's work going on on two different  
>>>>>>>>> aspects,
>>>>>>>>> which
>>>>>>>>> complement each other in the ideal setup:
>>>>>>>>>
>>>>>>>>> 1) Be able to query a Lucene index (wherever you store that)  
>>>>>>>>> to
>>>>>>>>> find
>>>>>>>>> objects
>>>>>>>>> which are located inside Infinispan; this is about how to  
>>>>>>>>> search
>>>>>>>>> them and how
>>>>>>>>> to maintain the index in synch with Infinispan's content.
>>>>>>>>>
>>>>>>>>> 2) Store a Lucene index inside Infinispan, instead of, for
>>>>>>>>> example,
>>>>>>>>> filesystem.
>>>>>>>>> In this case we're not concerned about what you index, the  
>>>>>>>>> Lucene
>>>>>>>>> interface
>>>>>>>>> is the usual one and you should be able to replace the  
>>>>>>>>> Directory
>>>>>>>>> implementation in existing applications.
>>>>>>>>>
>>>>>>>>> So 1) is the branch you've found, and Navin is working on  
>>>>>>>>> that,
>>>>>>>>> 2)
>>>>>>>>> is not yet
>>>>>>>>> in subversion, the latest patch is attached to other thread by
>>>>>>>>> Łukasz,
>>>>>>>>> and is to be applied
>>>>>>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>>>>>> <nsurtani at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>>>>>
>>>>>>>>>>>> oh ok nice - could you point me at which branch to try to  
>>>>>>>>>>>> find
>>>>>>>>>>>> some
>>>>>>>>>>>> tests to play with?
>>>>>>>>>>>
>>>>>>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>>>>>>
>>>>>>>>>>> The eventual goal is to be able to have different
>>>>>>>>>>> configurations on
>>>>>>>>>>> how you want to index your data. Manik has given me the  
>>>>>>>>>>> 'OK' to
>>>>>>>>>>> push a
>>>>>>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>>>>>>
>>>>>>>>>>> I'm kind-of pressed with getting the code working for this  
>>>>>>>>>>> and
>>>>>>>>>>> also
>>>>>>>>>>> between moving house and lack of internet there I'll be a  
>>>>>>>>>>> bit
>>>>>>>>>>> quiet.
>>>>>>>>>>> However, I'll get a wiki up by the end of the week about how
>>>>>>>>>>> this
>>>>>>>>>>> all
>>>>>>>>>>> works.
>>>>>>>>>>>
>>>>>>>>>>> However if you're not then I assume you're talking about  
>>>>>>>>>>> using
>>>>>>>>>>> Lucene
>>>>>>>>>>> to index into Infinispan?
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>>>>>> regarding indexing and queries - is the current aim to  
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> require
>>>>>>>>>>>>>> that the index for the entire data grid exist on a single
>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>>>>>>> indexes at
>>>>>>>>>>>>>> the moment is curious).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes the concept is to store the Lucene index itself in the
>>>>>>>>>>>>> grid,
>>>>>>>>>>>>> so
>>>>>>>>>>>>> it will
>>>>>>>>>>>>> be distributed, and the segments you use most get cached
>>>>>>>>>>>>> locally.
>>>>>>>>>>>>> At the moment you have to select only one node to write to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> index,
>>>>>>>>>>>>> but all other nodes should be able to read.
>>>>>>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>
>>>>>>>>>>> Navin Surtani
>>>>>>>>>>>
>>>>>>>>>>> Intern Infinispan
>>>>>>>>>>> Intern JBoss Cache Searchable
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael D Neale
>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>> _______________________________________________
>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> hibernate-dev mailing list
>>>>>>>>> hibernate-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Michael D Neale
>>>>>>> home: www.michaelneale.net
>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ray Hilton
>>>>>> -
>>>>>>         email: ray at wirestorm.net
>>>>>> melbourne: +61 (0) 3 9077 0513
>>>>>>       mobile: +61 (0) 430 484 708
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>>
>>>> --
>>>> Ray Hilton
>>>> -
>>>>         email: ray at wirestorm.net
>>>> melbourne: +61 (0) 3 9077 0513
>>>>       mobile: +61 (0) 430 484 708
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>>
>> --
>> Ray Hilton
>> -
>>         email: ray at wirestorm.net
>>  melbourne: +61 (0) 3 9077 0513
>>       mobile: +61 (0) 430 484 708
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> -- 
> Michael D Neale
> home: www.michaelneale.net
> blog: michaelneale.blogspot.com
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

Navin Surtani

Intern Infinispan
Intern JBoss Cache Searchable