[hibernate-dev] [Search] Sharding and access to (subsets) of index readers and Lucene directories in HS 4.0

Thu Aug 25 08:48:28 EDT 2011

> Current signature is not accepting the FullTextFilterImplementor, but
> accepts multiple classes:
> SearchFactory.openIndexReader(Class<?>... entities);
>
> Since we can't use two varargs on the same method, this won't work
> unless you're suggesting that we should support a single type only.
No, that's not what I meant. Maybe a ShardingOptions arg may become
usefull? 
And that's actually another requirement, access to IndexReader for
multiple classes/indexnames applying the same 'sharding options' for all
of them.

> I'd add another option:
> 
> C - SearchFactory.openIndexReader(String... indexName);
> 
> This is simple, but it is in no way delegating to the ShardingStrategy
> to make the index names choice which I think would be way more usable.

+1. Q: Is there a difference between the readers returned by:
SearchFactory.openIndexReader("A");
SearchFactory.openIndexReader(SubA.class);
if A and SubA (subclass of A) both have the bare @Indexed annotation,
thus sharing the same index name. That is, will
SearchFactory.openIndexReader(SubA.class) perform some filtering to only
return docs from SubA entities?

> Why would you need direct access to a Directory? isn't it enough to
> provide access to the IndexReader ?
No need for that, my mistake.

So requirements become:

---------
*1 Have access to IndexReader for one class
*2 Have access to IndexReader with a subset of IndexManagers based on
*3 Have access to IndexReader for multiple classes/indexnames applying
the same 'sharding options' for all of them.
---------

As Emmanuel mentioned, can we think of use cases where we would like to
have access to Lucene Directories (/IndexManagers), which is currently
mentioned in the docs:
http://docs.jboss.org/hibernate/search/4.0/reference/en-US/html_single/#d0e6658
?

Elmer

On Thu, 2011-08-25 at 13:37 +0200, Sanne Grinovero wrote:
> 2011/8/25 Elmer van Chastelet <evanchastelet at gmail.com>:
> > Hi all,
> >
> > Yesterday I had a discussion with Sanne on irc [3] about the new api to
> > access index readers in HS4.0. We couldn't complete our discussion
> > yesterday, so let's continue here.
> > As explained in the forum [1], there is currently no good solution for
> > getting a reader with a subset of the indexes in a sharded environment.
> >
> > Currently two basic ideas came to mind:
> > A - Have a SearchFactory.openIndexReader(Class<?> c,
> > FullTextFilterImplementor...): This is similar to how the IndexManager's
> > are gathered at query time, and is probably therefore easy to understand
> 
> Current signature is not accepting the FullTextFilterImplementor, but
> accepts multiple classes:
> SearchFactory.openIndexReader(Class<?>... entities);
> 
> Since we can't use two varargs on the same method, this won't work
> unless you're suggesting that we should support a single type only.
> 
> >
> > B - (to be further reviewed) Have something like
> > searchFactory.indexReaders().withShardingOptions( X, Y
> > ).includeType(Class<?> z).openIndexReader(). This also adds the ability
> > to get an IndexReader for multiple classes. But we need to think about
> > the .withShardingOptions (or something similar), what input should we
> > support here? Sharding properties are mostly based on some entity
> > property(/ies), probably easy to be encode as String. The (custom)
> > sharding strategy may use such String to select the proper index
> > managers. Using a String object for identifying which index managers to
> > use looks fine to me. It will be compatible with current implementation
> > of custom sharding strategies where one might use the Lucene document at
> > addition time, or if an entity instance will also be passed (see
> > discussion [2]), the properties of that entity can probably encoded to
> > some String. And if HS will cover the mapping/have support for Strings
> > as identifiers for sharding instead of a user defined mapping of the
> > index (integer) in the array of IndexManagers, that would be awesome :)
> > (Relieves the pain of having some mapping that should be stored
> > somewhere, which I currently do).
> 
> FullTextFilterImplementor could work as a withShardingOptions
> parameter, but while it's true that in the end it's going to filter
> some values, I wonder if the concept of Filters is misleading in this
> case.
> 
> > Still, we need to know the use cases there might be, i.e. which
> > flexibility the API should offer.
> 
> I'd add another option:
> 
> C - SearchFactory.openIndexReader(String... indexName);
> 
> This is simple, but it is in no way delegating to the ShardingStrategy
> to make the index names choice which I think would be way more usable.
> 
> > As is also mentioned in [1], there is currently no direct access to the
> > index managers, so getting a FSDirectory is currently not possible in
> > 4.0alpha1. I think HS should support this to offer the flexibility to
> > work on the Lucene indexes directly (for example, to build an auto
> > completion/spell check index from an existing index)
> 
> Why would you need direct access to a Directory? isn't it enough to
> provide access to the IndexReader ?
> 
> >
> > Let's start by setting up some requirements?
> > ---------
> > *1 Have access to IndexReader for one class
> > *2 Have access to IndexReader with a subset of IndexManagers based on
> > sharding strategy. Sharding strategies are mostly based on some
> > propert(y/ies) of an entity instance, which can likely be encoded to
> > some String.
> > *3 Have access to index directories (FSDirectory/...). Unlike previous
> > versions (< HS4.0) it would be nice if this uses the ShardingStrategy
> > instance in use, so mapping is completely and exclusively done in a
> > ShardingStrategy
> 
> We can't provide access to a "virtual" Directory exposing the contents
> of multiple Directories, that's possible with an IndexReader only.
> 
> > * ...
> > ---------
> >
> > Please extend/modify the list of requirements if you think something is
> > missing/incorrect and drop your ideas/thoughts about the mentioned
> > ideas.
> >
> >
> > Elmer
> >
> >
> >
> > [1] https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
> > [2]
> > http://www.mailinglistarchive.com/html/hibernate-dev@lists.jboss.org/2011-08/msg00091.html
> > [3] IRC log:
> >
> > <elmervc> sannegrinovero, have you read/did you have time to think about
> > https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
> > <sannegrinovero> hi elmervc , yes I've read it. my next thing on the
> > todo is to make some prototype, as I'm not happy with the current ideas:
> > <sannegrinovero> elmervc, are you blocked by this? the workaround is
> > very simple
> > <sannegrinovero> generally, I'm wondering if we can avoid having to
> > expose the DirectoryProviders. I would want them gone from the public
> > API, but of course limitations like this are not acceptable.
> > <elmervc> sannegrinovero, I'm branching this migration, so it's not
> > really blocking. But I would like to try the new H core/search, so for
> > that to work I need access to the subset of indices
> > <elmervc> What workaround were you thinking about ?
> > <elmervc> Just construct an index reader/FSDirs myself using 'hardcoded'
> > paths ?
> > <sannegrinovero> nono that's ugly..
> > <sannegrinovero> elmervc, all logic to open this IR is in
> > org.hibernate.search.impl.ImmutableSearchFactory.openIndexReader(Class<?>...)
> > <sannegrinovero> elmervc, and it's just  a couple of lines to change ;)
> > <sannegrinovero> the problem is more how to make it easy to consume
> > <elmervc> Ok, I'll look into that :)
> > <elmervc> Using filters is not a good idea?
> > <sannegrinovero> yes I liked your suggestion. but is it enough ?
> > <sannegrinovero> and how would the methods look like?
> > <sannegrinovero> (i.e. the signature)
> > <elmervc> SearchFactory.openIndexReader(Class<?> c,
> > FullTextFilterImplementor[] filters) , or what do you mean?
> > <sannegrinovero> I'd prefer SearchFactory.openIndexReader(Class<?> c,
> > FullTextFilterImplementor... filters)
> > <elmervc> But I'm not sure if this covers all use cases of sharding
> > <sannegrinovero> elmervc, the methods don't need necessarily be defined
> > on the SearchFactory. We can think of something like
> > searchFactory.indexReaders().withShardingOptions( X, Y
> > ).includeType(Class<?> z).openIndexReader() .. how does that look like?
> > <sannegrinovero> I'm just tossing out some ideas, but then we should
> > bring this up to the mailing list.
> > <elmervc> the .includeType , do you mean that multiple classes can be
> > included?
> > <sannegrinovero> yes
> > <sannegrinovero> basically the indexReaders() method would open a
> > context, private to this invocation chain only. (i.e. not affecting
> > other threads invoking .indexReaders() )
> > <elmervc> Sounds cool. But then we need to think about
> > the .withShardingOptions, or something similar. For transparancy it's
> > best to have something similar to the methods in the ShardingStrategy
> > interface
> > <elmervc> Or something similar to what is done @ querytime, i.e.
> > FullTextFilterImplementors
> > <elmervc> The point is, we need to know what other use cases one might
> > have
> > <elmervc> That's related to how sharding is done, i.e. ... might be a
> > field in the doc , full text filter, ...
> > <elmervc> (doc = doc to be added)
> > <sannegrinovero> yes exactly I need use cases to understand this, that's
> > why your feedback is very much appreciated :)
> > <elmervc> sannegrinovero, For example, our sharding strategy is based on
> > some field in an entity that is added to the Lucene Document (actually,
> > it has a @Field anno, and this field is removed from the Lucene Document
> > in the shardingstrategy.getDirectoryProviderForAddition(...)
> > <sannegrinovero> elmervc, lol that prooves another discussion I had
> > recently in proposing that we should pass the entity instance and not
> > the document to the sharding strategy.
> > <elmervc> It might be usefull indeed, but in our case it's easier to use
> > a Field in the doc, because that field will always have the same name,
> > i.e. we can reuse the same sharding strategy.
> > <sannegrinovero> elmervc, this discussion is very interesting but I'm
> > busy in other chats now which I can't postpone. Could you please
> > synthesize this and send a mail to the developer list?
> >
> > _______________________________________________
> > hibernate-dev mailing list
> > hibernate-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >