[hibernate-dev] [Search] Sharding and access to (subsets) of index readers and Lucene directories in HS 4.0

Thu Aug 25 06:09:50 EDT 2011

Hi all,

Yesterday I had a discussion with Sanne on irc [3] about the new api to
access index readers in HS4.0. We couldn't complete our discussion
yesterday, so let's continue here. 
As explained in the forum [1], there is currently no good solution for
getting a reader with a subset of the indexes in a sharded environment.

Currently two basic ideas came to mind:
A - Have a SearchFactory.openIndexReader(Class<?> c,
FullTextFilterImplementor...): This is similar to how the IndexManager's
are gathered at query time, and is probably therefore easy to understand

B - (to be further reviewed) Have something like
searchFactory.indexReaders().withShardingOptions( X, Y
).includeType(Class<?> z).openIndexReader(). This also adds the ability
to get an IndexReader for multiple classes. But we need to think about
the .withShardingOptions (or something similar), what input should we
support here? Sharding properties are mostly based on some entity
property(/ies), probably easy to be encode as String. The (custom)
sharding strategy may use such String to select the proper index
managers. Using a String object for identifying which index managers to
use looks fine to me. It will be compatible with current implementation
of custom sharding strategies where one might use the Lucene document at
addition time, or if an entity instance will also be passed (see
discussion [2]), the properties of that entity can probably encoded to
some String. And if HS will cover the mapping/have support for Strings
as identifiers for sharding instead of a user defined mapping of the
index (integer) in the array of IndexManagers, that would be awesome :)
(Relieves the pain of having some mapping that should be stored
somewhere, which I currently do).

Still, we need to know the use cases there might be, i.e. which
flexibility the API should offer.

As is also mentioned in [1], there is currently no direct access to the
index managers, so getting a FSDirectory is currently not possible in
4.0alpha1. I think HS should support this to offer the flexibility to
work on the Lucene indexes directly (for example, to build an auto
completion/spell check index from an existing index)

Let's start by setting up some requirements?
---------
*1 Have access to IndexReader for one class
*2 Have access to IndexReader with a subset of IndexManagers based on
sharding strategy. Sharding strategies are mostly based on some
propert(y/ies) of an entity instance, which can likely be encoded to
some String.
*3 Have access to index directories (FSDirectory/...). Unlike previous
versions (< HS4.0) it would be nice if this uses the ShardingStrategy
instance in use, so mapping is completely and exclusively done in a
ShardingStrategy
* ...
---------

Please extend/modify the list of requirements if you think something is
missing/incorrect and drop your ideas/thoughts about the mentioned
ideas.

Elmer

[1] https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
[2]
http://www.mailinglistarchive.com/html/hibernate-dev@lists.jboss.org/2011-08/msg00091.html
[3] IRC log:

<elmervc> sannegrinovero, have you read/did you have time to think about
https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
<sannegrinovero> hi elmervc , yes I've read it. my next thing on the
todo is to make some prototype, as I'm not happy with the current ideas:
<sannegrinovero> elmervc, are you blocked by this? the workaround is
very simple
<sannegrinovero> generally, I'm wondering if we can avoid having to
expose the DirectoryProviders. I would want them gone from the public
API, but of course limitations like this are not acceptable.
<elmervc> sannegrinovero, I'm branching this migration, so it's not
really blocking. But I would like to try the new H core/search, so for
that to work I need access to the subset of indices
<elmervc> What workaround were you thinking about ?
<elmervc> Just construct an index reader/FSDirs myself using 'hardcoded'
paths ?
<sannegrinovero> nono that's ugly..
<sannegrinovero> elmervc, all logic to open this IR is in
org.hibernate.search.impl.ImmutableSearchFactory.openIndexReader(Class<?>...)
<sannegrinovero> elmervc, and it's just  a couple of lines to change ;)
<sannegrinovero> the problem is more how to make it easy to consume
<elmervc> Ok, I'll look into that :)
<elmervc> Using filters is not a good idea?
<sannegrinovero> yes I liked your suggestion. but is it enough ?
<sannegrinovero> and how would the methods look like?
<sannegrinovero> (i.e. the signature)
<elmervc> SearchFactory.openIndexReader(Class<?> c,
FullTextFilterImplementor[] filters) , or what do you mean?
<sannegrinovero> I'd prefer SearchFactory.openIndexReader(Class<?> c,
FullTextFilterImplementor... filters)
<elmervc> But I'm not sure if this covers all use cases of sharding
<sannegrinovero> elmervc, the methods don't need necessarily be defined
on the SearchFactory. We can think of something like
searchFactory.indexReaders().withShardingOptions( X, Y
).includeType(Class<?> z).openIndexReader() .. how does that look like?
<sannegrinovero> I'm just tossing out some ideas, but then we should
bring this up to the mailing list.
<elmervc> the .includeType , do you mean that multiple classes can be
included?
<sannegrinovero> yes
<sannegrinovero> basically the indexReaders() method would open a
context, private to this invocation chain only. (i.e. not affecting
other threads invoking .indexReaders() )
<elmervc> Sounds cool. But then we need to think about
the .withShardingOptions, or something similar. For transparancy it's
best to have something similar to the methods in the ShardingStrategy
interface
<elmervc> Or something similar to what is done @ querytime, i.e.
FullTextFilterImplementors
<elmervc> The point is, we need to know what other use cases one might
have
<elmervc> That's related to how sharding is done, i.e. ... might be a
field in the doc , full text filter, ...
<elmervc> (doc = doc to be added)
<sannegrinovero> yes exactly I need use cases to understand this, that's
why your feedback is very much appreciated :)
<elmervc> sannegrinovero, For example, our sharding strategy is based on
some field in an entity that is added to the Lucene Document (actually,
it has a @Field anno, and this field is removed from the Lucene Document
in the shardingstrategy.getDirectoryProviderForAddition(...)
<sannegrinovero> elmervc, lol that prooves another discussion I had
recently in proposing that we should pass the entity instance and not
the document to the sharding strategy.
<elmervc> It might be usefull indeed, but in our case it's easier to use
a Field in the doc, because that field will always have the same name,
i.e. we can reuse the same sharding strategy.
<sannegrinovero> elmervc, this discussion is very interesting but I'm
busy in other chats now which I can't postpone. Could you please
synthesize this and send a mail to the developer list?