[infinispan-dev] [hibernate-dev] Distributed queries

Michael Neale michael.neale at gmail.com
Tue Sep 22 05:27:35 EDT 2009


well as part of a separate project to do with "cloud" stuff we will
have appliances/images (or will have) which can help setting up
reasonably large clusters for testing (and if we can do the testing
within an hour, should only cost a couple of dollars at a time for
100's of nodes).

On Tue, Sep 22, 2009 at 7:23 PM, Navin Surtani <nsurtani at redhat.com> wrote:
>
> On 22 Sep 2009, at 03:00, Michael Neale wrote:
>
>> I guess that could make sense for some cases - if the data changes are
>> small-ish, and the index calculation cost isn't huge... I guess if the
>> objects you are parking in infinispan are large, then it could end up
>> more efficient to only do the index once and then spread it around
>> (sticky to the data that it represents).
>
>
> I believe the plan is to build in a few different configs that would
> work for different use-cases. For example, a lot of "small" objects
> but not necessarily many nodes so they all share the same index or a
> lot of "big" objects sitting on disk on each individual node (where
> replication could be expensive).
>
> Or this is what I understood when speaking with Manik a couple of
> weeks ago that is :-).
>
>
>
>>
>> On Mon, Sep 21, 2009 at 11:09 PM, Ray Hilton <ray at wirestorm.net>
>> wrote:
>>> Im guessing that something similar already happens so that infinispan
>>> can re-jigg the data around the grid.  Forgive my lack of intimate
>>> knowledge of how infinispan works here, but at some point the data
>>> that was hosted by a bad node needs to be re-distributed?
>>>
>>> On Mon, Sep 21, 2009 at 11:02 PM, Emmanuel Bernard
>>> <emmanuel at hibernate.org> wrote:
>>>> could be possible. That would likely be chatty though each time a
>>>> node
>>>> comes or go.
>>>> Typically when a node goes down potentially due to network error,
>>>> you
>>>> don't wanna be chatty I imagine ;)
>>>>
>>>> On 21 sept. 09, at 14:59, Ray Hilton wrote:
>>>>
>>>>> Yes, point taken.
>>>>>
>>>>> Is there perhaps a way to only index an object on one node.  For
>>>>> example, if each node new there were currently 3 copies, and it was
>>>>> the node with the lowest id, for example, it would index the
>>>>> document.
>>>>> When a new node joins or a node fails, the strategy is re-applied
>>>>> and
>>>>> the node-local indices are updated accordingly.
>>>>>
>>>>> On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
>>>>> <emmanuel at hibernate.org> wrote:
>>>>>> Hello
>>>>>> See inline
>>>>>>
>>>>>> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I've been following the distributed query stuff with interest,
>>>>>>> but
>>>>>>> this is the first time I'm posting, so please excuse the lack of
>>>>>>> intimate knowledge of Infinispan.  Basically, I have been working
>>>>>>> on a
>>>>>>> project that could really do with the Holy Grail of a distributed
>>>>>>> query-able cache and I really liked the look of using
>>>>>>> JBossCache +
>>>>>>> the
>>>>>>> Lucene Directory implementation that Manik wrote a while back.  I
>>>>>>> then
>>>>>>> noticed Infinispan and talk of building querying directly into
>>>>>>> the
>>>>>>> project and figured that it would be worthwhile waiting to see
>>>>>>> how
>>>>>>> that panned out.
>>>>>>>
>>>>>>> I've thought a bit about how something like this might work,
>>>>>>> I'm not
>>>>>>> sure if this will be in any way helpful, but here goes:  I guess
>>>>>>> there
>>>>>>> are two approaches:  1) store the index (or partitioned
>>>>>>> indices) in
>>>>>>> the grid and sync it to a node to do a particular query or 2)
>>>>>>> each
>>>>>>> node has an index for the data it currently caches.  We preferred
>>>>>>> the
>>>>>>> second idea as it offers a natural way to partition the indices
>>>>>>> (i.e.
>>>>>>> however infinispan is configured to do it).  The first option
>>>>>>> would
>>>>>>> mean you end up with either a monolithic index in the grid, or
>>>>>>> partitions based on, say, date, that have to be sync'd en-mass to
>>>>>>> whichever node(s) are doing a query.  I realise that the second
>>>>>>> technique would produce duplicates, but Im sure there would be
>>>>>>> a way
>>>>>>> to eliminate dupes based on the object's uuid (something im
>>>>>>> pretty
>>>>>>> sure infinispan already has a notion of).
>>>>>>
>>>>>> Well 2 looks nicer but I don't know an obvious way to solve the
>>>>>> duplication issues:
>>>>>>  - returning several times the same content does alter the
>>>>>> scoring of
>>>>>> other documents
>>>>>>  - it prevent efficient pagination as somehow you need to jump
>>>>>> several results.
>>>>>>
>>>>>>>
>>>>>>> We would also need to come up with a way or normalising the
>>>>>>> scoring
>>>>>>> across all partitions (regardless of which method is used).  I
>>>>>>> have
>>>>>>> seen this done before, and it would basically involve, per-query,
>>>>>>> finding out the term frequency of the various keywords across the
>>>>>>> entire index, or at least enough of it to produce a
>>>>>>> representative
>>>>>>> value.  This would be used to calculate the score for each hit
>>>>>>> when
>>>>>>> doing the actual search, and thus the ranking.
>>>>>>
>>>>>> I believe Lucene does normalize the score properly when using the
>>>>>> remote IndexSearcher as the normalization is done on the "client"
>>>>>> side.
>>>>>>
>>>>>>>
>>>>>>> We have had issues with index corruption in the past as well
>>>>>>> (probably
>>>>>>> due to programming bugs rather than lucene).  Making each node
>>>>>>> responsible for its own index will make it very easy to throw
>>>>>>> corrupt
>>>>>>> indices away and re-generate new ones.
>>>>>>>
>>>>>>> I did take a look at the visitor stuff in Infinispan before,
>>>>>>> but I
>>>>>>> wasn't really sure where the best place to hook into would be to
>>>>>>> find
>>>>>>> out which objects are being stored locally or evicted.  If
>>>>>>> someone
>>>>>>> has
>>>>>>> a good idea of where to start, I'd be happy to lend a hand to to
>>>>>>> this
>>>>>>> effort!
>>>>>>>
>>>>>>> Ray
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com
>>>>>>>> wrote:
>>>>>>>> I think you just stuck a pin in the bubble that normally says
>>>>>>>> "magic
>>>>>>>> happens here" ;)
>>>>>>>>
>>>>>>>> How much of this did you tackle regarding hibernate search that
>>>>>>>> could
>>>>>>>> be applied here?
>>>>>>>>
>>>>>>>> (you final point re duplication may have some "flexibility" I
>>>>>>>> think ?)
>>>>>>>>
>>>>>>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>>>>>>> <emmanuel at hibernate.org> wrote:
>>>>>>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>>>>>>
>>>>>>>>> The hard parts with distributed queries (ie executed on a
>>>>>>>>> grid and
>>>>>>>>> recomposed) are:
>>>>>>>>>  - making sure you ask all the nodes where the index is
>>>>>>>>> distributed
>>>>>>>>> (you can't miss a node)
>>>>>>>>>  - find a way to index only a subset of the data in a given
>>>>>>>>> index
>>>>>>>>> (on
>>>>>>>>> a given node). Applying the Infinispan distribution routine
>>>>>>>>> to the
>>>>>>>>> InfinispanDirectory does not do that, it chunks data
>>>>>>>>> arbitrarily.
>>>>>>>>>  - be able to rebuild a given index on a givne node (ie
>>>>>>>>> remember
>>>>>>>>> which element were indexed)
>>>>>>>>>  - you need to find a way to distribute your data without
>>>>>>>>> duplication. If a key is indexed multiple times, then you end
>>>>>>>>> up
>>>>>>>>> with
>>>>>>>>> duplicated results that can't trivially be de-duplicated.
>>>>>>>>>
>>>>>>>>> Happy thinking.
>>>>>>>>>
>>>>>>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>>>>>>
>>>>>>>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>>> I am still not entirely sure what I am asking, but look
>>>>>>>>>>> forward
>>>>>>>>>>> for
>>>>>>>>>>> your merged in changes (they are in another branch right now
>>>>>>>>>>> yes?).
>>>>>>>>>>>
>>>>>>>>>>> Yes I mean querying objects - I was under the impression that
>>>>>>>>>>> lucene
>>>>>>>>>>> was used for the indexing of the data to service these
>>>>>>>>>>> queries?
>>>>>>>>>>
>>>>>>>>>> Sure, to clarify: there's work going on on two different
>>>>>>>>>> aspects,
>>>>>>>>>> which
>>>>>>>>>> complement each other in the ideal setup:
>>>>>>>>>>
>>>>>>>>>> 1) Be able to query a Lucene index (wherever you store that)
>>>>>>>>>> to
>>>>>>>>>> find
>>>>>>>>>> objects
>>>>>>>>>> which are located inside Infinispan; this is about how to
>>>>>>>>>> search
>>>>>>>>>> them and how
>>>>>>>>>> to maintain the index in synch with Infinispan's content.
>>>>>>>>>>
>>>>>>>>>> 2) Store a Lucene index inside Infinispan, instead of, for
>>>>>>>>>> example,
>>>>>>>>>> filesystem.
>>>>>>>>>> In this case we're not concerned about what you index, the
>>>>>>>>>> Lucene
>>>>>>>>>> interface
>>>>>>>>>> is the usual one and you should be able to replace the
>>>>>>>>>> Directory
>>>>>>>>>> implementation in existing applications.
>>>>>>>>>>
>>>>>>>>>> So 1) is the branch you've found, and Navin is working on
>>>>>>>>>> that,
>>>>>>>>>> 2)
>>>>>>>>>> is not yet
>>>>>>>>>> in subversion, the latest patch is attached to other thread by
>>>>>>>>>> Łukasz,
>>>>>>>>>> and is to be applied
>>>>>>>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>>>>>>> <nsurtani at redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> oh ok nice - could you point me at which branch to try to
>>>>>>>>>>>>> find
>>>>>>>>>>>>> some
>>>>>>>>>>>>> tests to play with?
>>>>>>>>>>>>
>>>>>>>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>>>>>>>
>>>>>>>>>>>> The eventual goal is to be able to have different
>>>>>>>>>>>> configurations on
>>>>>>>>>>>> how you want to index your data. Manik has given me the
>>>>>>>>>>>> 'OK' to
>>>>>>>>>>>> push a
>>>>>>>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm kind-of pressed with getting the code working for this
>>>>>>>>>>>> and
>>>>>>>>>>>> also
>>>>>>>>>>>> between moving house and lack of internet there I'll be a
>>>>>>>>>>>> bit
>>>>>>>>>>>> quiet.
>>>>>>>>>>>> However, I'll get a wiki up by the end of the week about how
>>>>>>>>>>>> this
>>>>>>>>>>>> all
>>>>>>>>>>>> works.
>>>>>>>>>>>>
>>>>>>>>>>>> However if you're not then I assume you're talking about
>>>>>>>>>>>> using
>>>>>>>>>>>> Lucene
>>>>>>>>>>>> to index into Infinispan?
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>>>>>>> regarding indexing and queries - is the current aim to
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>> that the index for the entire data grid exist on a single
>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>>>>>>>> indexes at
>>>>>>>>>>>>>>> the moment is curious).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes the concept is to store the Lucene index itself in the
>>>>>>>>>>>>>> grid,
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>> be distributed, and the segments you use most get cached
>>>>>>>>>>>>>> locally.
>>>>>>>>>>>>>> At the moment you have to select only one node to write to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> index,
>>>>>>>>>>>>>> but all other nodes should be able to read.
>>>>>>>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>
>>>>>>>>>>>> Navin Surtani
>>>>>>>>>>>>
>>>>>>>>>>>> Intern Infinispan
>>>>>>>>>>>> Intern JBoss Cache Searchable
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Michael D Neale
>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hibernate-dev mailing list
>>>>>>>>>> hibernate-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Michael D Neale
>>>>>>>> home: www.michaelneale.net
>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ray Hilton
>>>>>>> -
>>>>>>>         email: ray at wirestorm.net
>>>>>>> melbourne: +61 (0) 3 9077 0513
>>>>>>>       mobile: +61 (0) 430 484 708
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ray Hilton
>>>>> -
>>>>>         email: ray at wirestorm.net
>>>>> melbourne: +61 (0) 3 9077 0513
>>>>>       mobile: +61 (0) 430 484 708
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>>
>>> --
>>> Ray Hilton
>>> -
>>>         email: ray at wirestorm.net
>>>  melbourne: +61 (0) 3 9077 0513
>>>       mobile: +61 (0) 430 484 708
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>>
>> --
>> Michael D Neale
>> home: www.michaelneale.net
>> blog: michaelneale.blogspot.com
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> Navin Surtani
>
> Intern Infinispan
> Intern JBoss Cache Searchable
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev



-- 
Michael D Neale
home: www.michaelneale.net
blog: michaelneale.blogspot.com




More information about the infinispan-dev mailing list