[infinispan-dev] inverted distributed query

Tue Jun 19 18:08:41 EDT 2012

Replying to myself, after an IRC chat the problem got clarified:
the scale problem is not the amount of data to match against but the
amount of Queries being registered in the system, to which the new
Document needs to be matched.

Assuming we can store the Queries as Lucene Queries in the grid as
instances (you'll need to figure some way to serialize them, but that
should be easy since tracking how you create them), you index the
Document not in the usual Lucene index, but create an instance of an
org.apache.lucene.index.memory.MemoryIndex.

There is a full example in the javadocs:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html

Keep in mind this is not included in the Lucene core jar, you'll need
the additional dependency "lucene-memory".

I guess the rest is trivial.. wrap it into a Map/Reduce task and have
it fire against all stored queries in the grid.

Sorry for the confusion I had in the first answer.

Cheers,
Sanne

On 19 June 2012 20:40, Sanne Grinovero <sanne at infinispan.org> wrote:
> Hi Ales,
>
> there are several strategies, what might work best depends on several
> factors, not least on how many queries, index size, how much memory we
> can dedicate for query caches, and what the ratio of updates is.
>
> A Lucene Query produces a sparse BitSet, you can think of it as an
> ordered list of matching ids, and a common use case is to wrap this
> BitSet as a Filter so that it can be cached, reused and applied as
> mask on other queries.
>
> Assuming your set of predefined queries is rather limited, you can
> cache all these BitSets, and when you deal with a specific document,
> you search for it by "primary key" in the index (which is a very
> efficient query), so you get what identifier it has (as index in the
> bitset), and then you just look which queries are having a match.
>
> The good is that reusing those BitSets is very efficient, the bad news
> is that you have to rebuild some part of each BitSets (average of 10%
> with default configurations) every time an index update is applied.
> As a consequence, if what you need to do is list which queries match
> for every document you *insert* - compared to just reads -
> this is going to be an expensive approach.
>
> Are you going to need this both for a Map/Reduce Query and a Lucene
> Query, or are you just implying that both approaches would be fine for
> you?
>
> Do you have a practical example of such a Query? I'm wondering if
> you're looking for features like MoreLikeThis or tagging suggestions,
> which can be implemented more efficiently in different ways.
>
> Sanne
>
> On 19 June 2012 18:58, Ales Justin <ales.justin at gmail.com> wrote:
>> @Sanne, Vladimir: a think-task for you two :)
>>
>> With CapeDwarf we need the following feature -- just the opposite from query results.
>> A user has a document, and a set of pre-defined queries.
>> Now we need to see which queries match the given document.
>>
>> A dummy impl is to iterate over queries and find the ones that match.
>> But, this is of course not scalable.
>>
>> Any idea / suggestion on how to prepare Infinispan Query together with Distributed Execution framework to handle such feature?
>>
>> -Ales
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev