[infinispan-dev] Re: Infinispan and search APIs

Wed May 6 06:34:39 EDT 2009

Emmanuel and I started discussing this offline, bringing this on to  
the ML now.

On 5 May 2009, at 16:49, Emmanuel Bernard wrote:

>
> On  May 1, 2009, at 16:13, Manik Surtani wrote:
>
>> Hey dude
>>
>> When you have some time, could we chat about this?  Since we have a  
>> chance to build any hooks into Infinispan that we may need in  
>> future for the query API, it would be good to get an idea of this  
>> now.
>
> you need:
> - change notification (create, update, delete)
> - define a change notification as belonging to a context (tx usually)
> - define a way to execute things at context ending (typically  
> tx.commit)
> - start / stop notifications incl the ability to pass properties the  
> Infinispan way (probably materialized as Map<String, Object> and the  
> list of object types being indexed. (relaxing this constraint leads  
> to a reduction of concurrency in HSearch)
> - ability to access the HSearch change interceptor to retrieve the  
> SearchFactory from the main Infinispan api (could be a SPI) (ie from  
> an Hibernate SessionImpl we can go access the SessionFactory and the  
> inerceptor and then the SearchFactory
> - might need an object key from an object (not sure)
> - ability to load a set of objects by id (in batch)

Yes, this is all stuff we figured out in JBC-Searchable.  All  
straightforward enough to do.  And rather than implement as a listener  
as we did in JBC, we ought to make this an interceptor - it will  
perform better and you have greater access to transaction context.

> I think that's the most important hooks.
> Also the ability to put a Lucene index on infinispan

Well, there should be two options to storing indexes.  The first, a  
simplistic approach where indexes are replicated everywhere, is  
achieved by using a separate cache for indexes which uses REPL  
regardless of what cache mode the data cache uses.  Any query on any  
instance uses the cached indexes, to retrieve keys, and then does a  
get() on the data cache to load keys.  This load may retrieve the  
entries from across the network if the cache is in DIST mode, load off  
a cache store, etc.

The second approach to storing indexes is more interesting IMO.  Each  
node only maintains LOCAL (*) indexes for keys that are mapped on to  
itself.  E.g., in a cluster of {A, B, C, D, E}, we have {K1, K2, K3}  
mapped to A and {K4, K5, K6} mapped to C.  A stores indexes for K1 -  
K3 locally, and C stores indexes for K4 - K6.  Running a query on,  
say, E, would result in a query RPC call broadcast around the cluster  
and each node runs the query on their local indexes only, returning  
results to E.  E then collates these results.  I think this is much  
more scalable as a) the indexes themselves are fragmented and the  
system will be able to cope with more indexes as you add more nodes,  
and b) processing time to search through the indexes is divided up  
between the processors.

Naturally there is overhead in broadcasting the query and collating  
results, and this is why we need efficient result set implementations  
that could retrieve this lazily, etc.  E.g., your result iterator  
should not wait for all results to arrive before starting to iterate  
through what you have, and nodes should first send back result counts  
before actually loading and sending back results, etc.  But this is a  
part of what we need to design.

(*) We could use DIST for the indexes as well, provided the data  
indexed and the indexes are hashed to the same nodes for storage.   
This is, IMO, tricky since indexes would span several entries, not all  
of which are guaranteed to be hashed to the same nodes.  Hence my  
thoughts on indexes being stored in a LOCAL cache.

>> Like I was saying, one of the things we would need is the ability  
>> to do aggregate queries.  The "nice to have" would be to offer an  
>> EJBQL style interface rather than a Lucene one, but the second part  
>> is not crucial IMO.
>
> aggregate query is kinda possible:
> - count(*) is really just getResultSize()
> - distinct count(property) is harder, needs some agregattion in  
> memory but likely possible
> - sum() / avg() would need some actual aggregation in memory from  
> all matching results we could store some values in the lucene index  
> and sum them
> - group by  / having is harder and will probably have to either be  
> done in memory or done by calling n queries, one per "group" (likely  
> doable)

If you feel this can be achieved easily enough using Lucene queries,  
I'm fine with this being the basis of our impl.

> JPA-QL query is more a fantasy, just like GAE has "support" for JPA- 
> QL. Every time join is used, you would be fucked.

It's just a familiarity thing - I don't necessarily need JPA-QL in  
itself, just that it has decent and familiar support for aggregation  
and query writing in general.  Joins would obviously not be supported  
and the query parser would barf on encountering one.

The other plus with JPA-QL is that it could offer easy migration off  
JPA and onto a data grid to some degree (certain caveats exist such as  
joins though).  Like I said, this isn't important, could even be a  
plugin for later on that translates JPA-QL to Lucene queries.

Cheers
--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org