Re: [infinispan-dev] [hibernate-dev] Distributed queries

Tuesday, 22 September 2009

On 22 Sep 2009, at 03:00, Michael Neale wrote:

...
 I guess that could make sense for some cases - if the data changes
are
 small-ish, and the index calculation cost isn't huge... I guess if the
 objects you are parking in infinispan are large, then it could end up
 more efficient to only do the index once and then spread it around
 (sticky to the data that it represents). 

I believe the plan is to build in a few different configs that would  
work for different use-cases. For example, a lot of "small" objects  
but not necessarily many nodes so they all share the same index or a  
lot of "big" objects sitting on disk on each individual node (where  
replication could be expensive).

Or this is what I understood when speaking with Manik a couple of  
weeks ago that is :-).

...

 On Mon, Sep 21, 2009 at 11:09 PM, Ray Hilton <ray(a)wirestorm.net&gt;  
 wrote:
> Im guessing that something similar already happens so that infinispan
> can re-jigg the data around the grid.  Forgive my lack of intimate
> knowledge of how infinispan works here, but at some point the data
> that was hosted by a bad node needs to be re-distributed?
>
> On Mon, Sep 21, 2009 at 11:02 PM, Emmanuel Bernard
> <emmanuel(a)hibernate.org&gt; wrote:
>> could be possible. That would likely be chatty though each time a  
>> node
>> comes or go.
>> Typically when a node goes down potentially due to network error,  
>> you
>> don't wanna be chatty I imagine ;)
>>
>> On 21 sept. 09, at 14:59, Ray Hilton wrote:
>>
>>> Yes, point taken.
>>>
>>> Is there perhaps a way to only index an object on one node.  For
>>> example, if each node new there were currently 3 copies, and it was
>>> the node with the lowest id, for example, it would index the  
>>> document.
>>> When a new node joins or a node fails, the strategy is re-applied  
>>> and
>>> the node-local indices are updated accordingly.
>>>
>>> On Mon, Sep 21, 2009 at 6:06 PM, Emmanuel Bernard
>>> <emmanuel(a)hibernate.org&gt; wrote:
>>>> Hello
>>>> See inline
>>>>
>>>> On 20 sept. 09, at 06:01, Ray Hilton wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I've been following the distributed query stuff with interest,  
>>>>> but
>>>>> this is the first time I'm posting, so please excuse the lack of
>>>>> intimate knowledge of Infinispan.  Basically, I have been working
>>>>> on a
>>>>> project that could really do with the Holy Grail of a distributed
>>>>> query-able cache and I really liked the look of using  
>>>>> JBossCache +
>>>>> the
>>>>> Lucene Directory implementation that Manik wrote a while back.  I
>>>>> then
>>>>> noticed Infinispan and talk of building querying directly into  
>>>>> the
>>>>> project and figured that it would be worthwhile waiting to see  
>>>>> how
>>>>> that panned out.
>>>>>
>>>>> I've thought a bit about how something like this might work,  
>>>>> I'm not
>>>>> sure if this will be in any way helpful, but here goes:  I guess
>>>>> there
>>>>> are two approaches:  1) store the index (or partitioned  
>>>>> indices) in
>>>>> the grid and sync it to a node to do a particular query or 2)  
>>>>> each
>>>>> node has an index for the data it currently caches.  We preferred
>>>>> the
>>>>> second idea as it offers a natural way to partition the indices
>>>>> (i.e.
>>>>> however infinispan is configured to do it).  The first option  
>>>>> would
>>>>> mean you end up with either a monolithic index in the grid, or
>>>>> partitions based on, say, date, that have to be sync'd en-mass
to
>>>>> whichever node(s) are doing a query.  I realise that the second
>>>>> technique would produce duplicates, but Im sure there would be  
>>>>> a way
>>>>> to eliminate dupes based on the object's uuid (something im  
>>>>> pretty
>>>>> sure infinispan already has a notion of).
>>>>
>>>> Well 2 looks nicer but I don't know an obvious way to solve the
>>>> duplication issues:
>>>>  - returning several times the same content does alter the  
>>>> scoring of
>>>> other documents
>>>>  - it prevent efficient pagination as somehow you need to jump
>>>> several results.
>>>>
>>>>>
>>>>> We would also need to come up with a way or normalising the  
>>>>> scoring
>>>>> across all partitions (regardless of which method is used).  I  
>>>>> have
>>>>> seen this done before, and it would basically involve, per-query,
>>>>> finding out the term frequency of the various keywords across the
>>>>> entire index, or at least enough of it to produce a  
>>>>> representative
>>>>> value.  This would be used to calculate the score for each hit  
>>>>> when
>>>>> doing the actual search, and thus the ranking.
>>>>
>>>> I believe Lucene does normalize the score properly when using the
>>>> remote IndexSearcher as the normalization is done on the
"client"
>>>> side.
>>>>
>>>>>
>>>>> We have had issues with index corruption in the past as well
>>>>> (probably
>>>>> due to programming bugs rather than lucene).  Making each node
>>>>> responsible for its own index will make it very easy to throw
>>>>> corrupt
>>>>> indices away and re-generate new ones.
>>>>>
>>>>> I did take a look at the visitor stuff in Infinispan before,  
>>>>> but I
>>>>> wasn't really sure where the best place to hook into would be to
>>>>> find
>>>>> out which objects are being stored locally or evicted.  If  
>>>>> someone
>>>>> has
>>>>> a good idea of where to start, I'd be happy to lend a hand to to
>>>>> this
>>>>> effort!
>>>>>
>>>>> Ray
>>>>>
>>>>>
>>>>> On Sat, Sep 19, 2009 at 8:43 PM, Michael Neale
<michael.neale(a)gmail.com
>>>>>> wrote:
>>>>>> I think you just stuck a pin in the bubble that normally says
>>>>>> "magic
>>>>>> happens here" ;)
>>>>>>
>>>>>> How much of this did you tackle regarding hibernate search that
>>>>>> could
>>>>>> be applied here?
>>>>>>
>>>>>> (you final point re duplication may have some
"flexibility" I
>>>>>> think ?)
>>>>>>
>>>>>> On Fri, Sep 18, 2009 at 6:18 PM, Emmanuel Bernard
>>>>>> <emmanuel(a)hibernate.org&gt; wrote:
>>>>>>> Neither 1 nor 2 imply *distributed* queries.
>>>>>>>
>>>>>>> The hard parts with distributed queries (ie executed on a  
>>>>>>> grid and
>>>>>>> recomposed) are:
>>>>>>>  - making sure you ask all the nodes where the index is
>>>>>>> distributed
>>>>>>> (you can't miss a node)
>>>>>>>  - find a way to index only a subset of the data in a given 

>>>>>>> index
>>>>>>> (on
>>>>>>> a given node). Applying the Infinispan distribution routine 

>>>>>>> to the
>>>>>>> InfinispanDirectory does not do that, it chunks data  
>>>>>>> arbitrarily.
>>>>>>>  - be able to rebuild a given index on a givne node (ie  
>>>>>>> remember
>>>>>>> which element were indexed)
>>>>>>>  - you need to find a way to distribute your data without
>>>>>>> duplication. If a key is indexed multiple times, then you end

>>>>>>> up
>>>>>>> with
>>>>>>> duplicated results that can't trivially be
de-duplicated.
>>>>>>>
>>>>>>> Happy thinking.
>>>>>>>
>>>>>>> On 17 sept. 09, at 10:32, Sanne Grinovero wrote:
>>>>>>>
>>>>>>>> 2009/9/17 Michael Neale <michael.neale(a)gmail.com&gt;:
>>>>>>>>> I am still not entirely sure what I am asking, but
look  
>>>>>>>>> forward
>>>>>>>>> for
>>>>>>>>> your merged in changes (they are in another branch
right now
>>>>>>>>> yes?).
>>>>>>>>>
>>>>>>>>> Yes I mean querying objects - I was under the
impression that
>>>>>>>>> lucene
>>>>>>>>> was used for the indexing of the data to service
these  
>>>>>>>>> queries?
>>>>>>>>
>>>>>>>> Sure, to clarify: there's work going on on two
different  
>>>>>>>> aspects,
>>>>>>>> which
>>>>>>>> complement each other in the ideal setup:
>>>>>>>>
>>>>>>>> 1) Be able to query a Lucene index (wherever you store
that)  
>>>>>>>> to
>>>>>>>> find
>>>>>>>> objects
>>>>>>>> which are located inside Infinispan; this is about how to

>>>>>>>> search
>>>>>>>> them and how
>>>>>>>> to maintain the index in synch with Infinispan's
content.
>>>>>>>>
>>>>>>>> 2) Store a Lucene index inside Infinispan, instead of,
for
>>>>>>>> example,
>>>>>>>> filesystem.
>>>>>>>> In this case we're not concerned about what you
index, the  
>>>>>>>> Lucene
>>>>>>>> interface
>>>>>>>> is the usual one and you should be able to replace the  
>>>>>>>> Directory
>>>>>>>> implementation in existing applications.
>>>>>>>>
>>>>>>>> So 1) is the branch you've found, and Navin is
working on  
>>>>>>>> that,
>>>>>>>> 2)
>>>>>>>> is not yet
>>>>>>>> in subversion, the latest patch is attached to other
thread by
>>>>>>>> Łukasz,
>>>>>>>> and is to be applied
>>>>>>>> on Hibernate Search's trunk (and depends on
Infinispan).
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 16, 2009 at 10:32 PM, Navin Surtani
>>>>>>>>> <nsurtani(a)redhat.com&gt; wrote:
>>>>>>>>>>
>>>>>>>>>> On 16 Sep 2009, at 12:25, Michael Neale wrote:
>>>>>>>>>>
>>>>>>>>>>> oh ok nice - could you point me at which
branch to try to  
>>>>>>>>>>> find
>>>>>>>>>>> some
>>>>>>>>>>> tests to play with?
>>>>>>>>>>
>>>>>>>>>> If you're talking about Querying objects in
Infinispan: -
>>>>>>>>>>
>>>>>>>>>> The eventual goal is to be able to have
different
>>>>>>>>>> configurations on
>>>>>>>>>> how you want to index your data. Manik has given
me the  
>>>>>>>>>> 'OK' to
>>>>>>>>>> push a
>>>>>>>>>> simple query interface for CR1 for
Monday/Tuesday.
>>>>>>>>>>
>>>>>>>>>> I'm kind-of pressed with getting the code
working for this  
>>>>>>>>>> and
>>>>>>>>>> also
>>>>>>>>>> between moving house and lack of internet there
I'll be a  
>>>>>>>>>> bit
>>>>>>>>>> quiet.
>>>>>>>>>> However, I'll get a wiki up by the end of the
week about how
>>>>>>>>>> this
>>>>>>>>>> all
>>>>>>>>>> works.
>>>>>>>>>>
>>>>>>>>>> However if you're not then I assume
you're talking about  
>>>>>>>>>> using
>>>>>>>>>> Lucene
>>>>>>>>>> to index into Infinispan?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 16, 2009 at 6:05 PM, Sanne
Grinovero
>>>>>>>>>>> <sanne.grinovero(a)gmail.com&gt; wrote:
>>>>>>>>>>>> 2009/9/16 Michael Neale
<michael.neale(a)gmail.com&gt;:
>>>>>>>>>>>>> regarding indexing and queries - is
the current aim to  
>>>>>>>>>>>>> not
>>>>>>>>>>>>> require
>>>>>>>>>>>>> that the index for the entire data
grid exist on a single
>>>>>>>>>>>>> node?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (asking as a potential user who is
wrestling with lucene
>>>>>>>>>>>>> indexes at
>>>>>>>>>>>>> the moment is curious).
>>>>>>>>>>>>
>>>>>>>>>>>> Yes the concept is to store the Lucene
index itself in the
>>>>>>>>>>>> grid,
>>>>>>>>>>>> so
>>>>>>>>>>>> it will
>>>>>>>>>>>> be distributed, and the segments you use
most get cached
>>>>>>>>>>>> locally.
>>>>>>>>>>>> At the moment you have to select only one
node to write to
>>>>>>>>>>>> the
>>>>>>>>>>>> index,
>>>>>>>>>>>> but all other nodes should be able to
read.
>>>>>>>>>>>> Feel free to test it as we are needing
feedback.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Michael D Neale
>>>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>>
>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Michael D Neale
>>>>>>>>>>> home: www.michaelneale.net
>>>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>
>>>>>>>>>> Navin Surtani
>>>>>>>>>>
>>>>>>>>>> Intern Infinispan
>>>>>>>>>> Intern JBoss Cache Searchable
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> infinispan-dev mailing list
>>>>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Michael D Neale
>>>>>>>>> home: www.michaelneale.net
>>>>>>>>> blog: michaelneale.blogspot.com
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> hibernate-dev mailing list
>>>>>>>> hibernate-dev(a)lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael D Neale
>>>>>> home: www.michaelneale.net
>>>>>> blog: michaelneale.blogspot.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ray Hilton
>>>>> -
>>>>>         email: ray(a)wirestorm.net
>>>>> melbourne: +61 (0) 3 9077 0513
>>>>>       mobile: +61 (0) 430 484 708
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev(a)lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev(a)lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>>
>>> --
>>> Ray Hilton
>>> -
>>>         email: ray(a)wirestorm.net
>>> melbourne: +61 (0) 3 9077 0513
>>>       mobile: +61 (0) 430 484 708
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev(a)lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> --
> Ray Hilton
> -
>         email: ray(a)wirestorm.net
>  melbourne: +61 (0) 3 9077 0513
>       mobile: +61 (0) 430 484 708
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

 -- 
 Michael D Neale
 home: www.michaelneale.net
 blog: michaelneale.blogspot.com

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev 
Navin Surtani

Intern Infinispan
Intern JBoss Cache Searchable

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] [hibernate-dev] Distributed queries