[infinispan-dev] [hibernate-dev] Distributed queries
Emmanuel Bernard
emmanuel at hibernate.org
Mon Sep 21 09:02:45 EDT 2009
could be possible. That would likely be chatty though each time a node
comes or go.
Typically when a node goes down potentially due to network error, you
don't wanna be chatty I imagine ;)
On 21 sept. 09, at 14:59, Ray Hilton wrote:
> Yes, point taken.
>
> Is there perhaps a way to only index an object on one node. For
> example, if each node new there were currently 3 copies, and it was
> the node with the lowest id, for example, it would index the document.
> When a new node joins or a node fails, the strategy is re-applied and
> the node-local indices are updated accordingly.
>
> On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
> <emmanuel at hibernate.org> wrote:
>> Hello
>> See inline
>>
>> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>>
>>> Hi guys,
>>>
>>> I've been following the distributed query stuff with interest, but
>>> this is the first time I'm posting, so please excuse the lack of
>>> intimate knowledge of Infinispan. Basically, I have been working
>>> on a
>>> project that could really do with the Holy Grail of a distributed
>>> query-able cache and I really liked the look of using JBossCache +
>>> the
>>> Lucene Directory implementation that Manik wrote a while back. I
>>> then
>>> noticed Infinispan and talk of building querying directly into the
>>> project and figured that it would be worthwhile waiting to see how
>>> that panned out.
>>>
>>> I've thought a bit about how something like this might work, I'm not
>>> sure if this will be in any way helpful, but here goes: I guess
>>> there
>>> are two approaches: 1) store the index (or partitioned indices) in
>>> the grid and sync it to a node to do a particular query or 2) each
>>> node has an index for the data it currently caches. We preferred
>>> the
>>> second idea as it offers a natural way to partition the indices
>>> (i.e.
>>> however infinispan is configured to do it). The first option would
>>> mean you end up with either a monolithic index in the grid, or
>>> partitions based on, say, date, that have to be sync'd en-mass to
>>> whichever node(s) are doing a query. I realise that the second
>>> technique would produce duplicates, but Im sure there would be a way
>>> to eliminate dupes based on the object's uuid (something im pretty
>>> sure infinispan already has a notion of).
>>
>> Well 2 looks nicer but I don't know an obvious way to solve the
>> duplication issues:
>> - returning several times the same content does alter the scoring of
>> other documents
>> - it prevent efficient pagination as somehow you need to jump
>> several results.
>>
>>>
>>> We would also need to come up with a way or normalising the scoring
>>> across all partitions (regardless of which method is used). I have
>>> seen this done before, and it would basically involve, per-query,
>>> finding out the term frequency of the various keywords across the
>>> entire index, or at least enough of it to produce a representative
>>> value. This would be used to calculate the score for each hit when
>>> doing the actual search, and thus the ranking.
>>
>> I believe Lucene does normalize the score properly when using the
>> remote IndexSearcher as the normalization is done on the "client"
>> side.
>>
>>>
>>> We have had issues with index corruption in the past as well
>>> (probably
>>> due to programming bugs rather than lucene). Making each node
>>> responsible for its own index will make it very easy to throw
>>> corrupt
>>> indices away and re-generate new ones.
>>>
>>> I did take a look at the visitor stuff in Infinispan before, but I
>>> wasn't really sure where the best place to hook into would be to
>>> find
>>> out which objects are being stored locally or evicted. If someone
>>> has
>>> a good idea of where to start, I'd be happy to lend a hand to to
>>> this
>>> effort!
>>>
>>> Ray
>>>
>>>
>>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale <michael.neale at gmail.com
>>>> wrote:
>>>> I think you just stuck a pin in the bubble that normally says
>>>> "magic
>>>> happens here" ;)
>>>>
>>>> How much of this did you tackle regarding hibernate search that
>>>> could
>>>> be applied here?
>>>>
>>>> (you final point re duplication may have some "flexibility" I
>>>> think ?)
>>>>
>>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>>> <emmanuel at hibernate.org> wrote:
>>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>>
>>>>> The hard parts with distributed queries (ie executed on a grid and
>>>>> recomposed) are:
>>>>> - making sure you ask all the nodes where the index is
>>>>> distributed
>>>>> (you can't miss a node)
>>>>> - find a way to index only a subset of the data in a given index
>>>>> (on
>>>>> a given node). Applying the Infinispan distribution routine to the
>>>>> InfinispanDirectory does not do that, it chunks data arbitrarily.
>>>>> - be able to rebuild a given index on a givne node (ie remember
>>>>> which element were indexed)
>>>>> - you need to find a way to distribute your data without
>>>>> duplication. If a key is indexed multiple times, then you end up
>>>>> with
>>>>> duplicated results that can't trivially be de-duplicated.
>>>>>
>>>>> Happy thinking.
>>>>>
>>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>>
>>>>>> 2009/9/17 Michael Neale <michael.neale at gmail.com>:
>>>>>>> I am still not entirely sure what I am asking, but look forward
>>>>>>> for
>>>>>>> your merged in changes (they are in another branch right now
>>>>>>> yes?).
>>>>>>>
>>>>>>> Yes I mean querying objects - I was under the impression that
>>>>>>> lucene
>>>>>>> was used for the indexing of the data to service these queries?
>>>>>>
>>>>>> Sure, to clarify: there's work going on on two different aspects,
>>>>>> which
>>>>>> complement each other in the ideal setup:
>>>>>>
>>>>>> 1) Be able to query a Lucene index (wherever you store that) to
>>>>>> find
>>>>>> objects
>>>>>> which are located inside Infinispan; this is about how to search
>>>>>> them and how
>>>>>> to maintain the index in synch with Infinispan's content.
>>>>>>
>>>>>> 2) Store a Lucene index inside Infinispan, instead of, for
>>>>>> example,
>>>>>> filesystem.
>>>>>> In this case we're not concerned about what you index, the Lucene
>>>>>> interface
>>>>>> is the usual one and you should be able to replace the Directory
>>>>>> implementation in existing applications.
>>>>>>
>>>>>> So 1) is the branch you've found, and Navin is working on that,
>>>>>> 2)
>>>>>> is not yet
>>>>>> in subversion, the latest patch is attached to other thread by
>>>>>> Łukasz,
>>>>>> and is to be applied
>>>>>> on Hibernate Search's trunk (and depends on Infinispan).
>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>>> <nsurtani at redhat.com> wrote:
>>>>>>>>
>>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>>
>>>>>>>>> oh ok nice - could you point me at which branch to try to find
>>>>>>>>> some
>>>>>>>>> tests to play with?
>>>>>>>>
>>>>>>>> If you're talking about Querying objects in Infinispan: -
>>>>>>>>
>>>>>>>> The eventual goal is to be able to have different
>>>>>>>> configurations on
>>>>>>>> how you want to index your data. Manik has given me the 'OK' to
>>>>>>>> push a
>>>>>>>> simple query interface for CR1 for Monday/Tuesday.
>>>>>>>>
>>>>>>>> I'm kind-of pressed with getting the code working for this and
>>>>>>>> also
>>>>>>>> between moving house and lack of internet there I'll be a bit
>>>>>>>> quiet.
>>>>>>>> However, I'll get a wiki up by the end of the week about how
>>>>>>>> this
>>>>>>>> all
>>>>>>>> works.
>>>>>>>>
>>>>>>>> However if you're not then I assume you're talking about using
>>>>>>>> Lucene
>>>>>>>> to index into Infinispan?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne Grinovero
>>>>>>>>> <sanne.grinovero at gmail.com> wrote:
>>>>>>>>>> 2009/9/16 Michael Neale <michael.neale at gmail.com>:
>>>>>>>>>>> regarding indexing and queries - is the current aim to not
>>>>>>>>>>> require
>>>>>>>>>>> that the index for the entire data grid exist on a single
>>>>>>>>>>> node?
>>>>>>>>>>>
>>>>>>>>>>> (asking as a potential user who is wrestling with lucene
>>>>>>>>>>> indexes at
>>>>>>>>>>> the moment is curious).
>>>>>>>>>>
>>>>>>>>>> Yes the concept is to store the Lucene index itself in the
>>>>>>>>>> grid,
>>>>>>>>>> so
>>>>>>>>>> it will
>>>>>>>>>> be distributed, and the segments you use most get cached
>>>>>>>>>> locally.
>>>>>>>>>> At the moment you have to select only one node to write to
>>>>>>>>>> the
>>>>>>>>>> index,
>>>>>>>>>> but all other nodes should be able to read.
>>>>>>>>>> Feel free to test it as we are needing feedback.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Michael D Neale
>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Michael D Neale
>>>>>>>>> home: www.michaelneale.net
>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>> Navin Surtani
>>>>>>>>
>>>>>>>> Intern Infinispan
>>>>>>>> Intern JBoss Cache Searchable
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Michael D Neale
>>>>>>> home: www.michaelneale.net
>>>>>>> blog: michaelneale.blogspot.com
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> hibernate-dev mailing list
>>>>>> hibernate-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>>
>>>> --
>>>> Michael D Neale
>>>> home: www.michaelneale.net
>>>> blog: michaelneale.blogspot.com
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>>
>>> --
>>> Ray Hilton
>>> -
>>> email: ray at wirestorm.net
>>> melbourne: +61 (0) 3 9077 0513
>>> mobile: +61 (0) 430 484 708
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> --
> Ray Hilton
> -
> email: ray at wirestorm.net
> melbourne: +61 (0) 3 9077 0513
> mobile: +61 (0) 430 484 708
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
More information about the infinispan-dev
mailing list