Replying to myself, after an IRC chat the problem got clarified:
the scale problem is not the amount of data to match against but the
amount of Queries being registered in the system, to which the new
Document needs to be matched.
Assuming we can store the Queries as Lucene Queries in the grid as
instances (you'll need to figure some way to serialize them, but that
should be easy since tracking how you create them), you index the
Document not in the usual Lucene index, but create an instance of an
org.apache.lucene.index.memory.MemoryIndex.
There is a full example in the javadocs:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/memor...
Keep in mind this is not included in the Lucene core jar, you'll need
the additional dependency "lucene-memory".
I guess the rest is trivial.. wrap it into a Map/Reduce task and have
it fire against all stored queries in the grid.
Sorry for the confusion I had in the first answer.
Cheers,
Sanne
On 19 June 2012 20:40, Sanne Grinovero <sanne(a)infinispan.org> wrote:
Hi Ales,
there are several strategies, what might work best depends on several
factors, not least on how many queries, index size, how much memory we
can dedicate for query caches, and what the ratio of updates is.
A Lucene Query produces a sparse BitSet, you can think of it as an
ordered list of matching ids, and a common use case is to wrap this
BitSet as a Filter so that it can be cached, reused and applied as
mask on other queries.
Assuming your set of predefined queries is rather limited, you can
cache all these BitSets, and when you deal with a specific document,
you search for it by "primary key" in the index (which is a very
efficient query), so you get what identifier it has (as index in the
bitset), and then you just look which queries are having a match.
The good is that reusing those BitSets is very efficient, the bad news
is that you have to rebuild some part of each BitSets (average of 10%
with default configurations) every time an index update is applied.
As a consequence, if what you need to do is list which queries match
for every document you *insert* - compared to just reads -
this is going to be an expensive approach.
Are you going to need this both for a Map/Reduce Query and a Lucene
Query, or are you just implying that both approaches would be fine for
you?
Do you have a practical example of such a Query? I'm wondering if
you're looking for features like MoreLikeThis or tagging suggestions,
which can be implemented more efficiently in different ways.
Sanne
On 19 June 2012 18:58, Ales Justin <ales.justin(a)gmail.com> wrote:
> @Sanne, Vladimir: a think-task for you two :)
>
> With CapeDwarf we need the following feature -- just the opposite from query
results.
> A user has a document, and a set of pre-defined queries.
> Now we need to see which queries match the given document.
>
> A dummy impl is to iterate over queries and find the ones that match.
> But, this is of course not scalable.
>
> Any idea / suggestion on how to prepare Infinispan Query together with Distributed
Execution framework to handle such feature?
>
> -Ales
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev