[hibernate-dev] [Search] Sharding and access to (subsets) of index readers and Lucene directories in HS 4.0

Sanne Grinovero sanne at hibernate.org
Thu Aug 25 07:37:41 EDT 2011


2011/8/25 Elmer van Chastelet <evanchastelet at gmail.com>:
> Hi all,
>
> Yesterday I had a discussion with Sanne on irc [3] about the new api to
> access index readers in HS4.0. We couldn't complete our discussion
> yesterday, so let's continue here.
> As explained in the forum [1], there is currently no good solution for
> getting a reader with a subset of the indexes in a sharded environment.
>
> Currently two basic ideas came to mind:
> A - Have a SearchFactory.openIndexReader(Class<?> c,
> FullTextFilterImplementor...): This is similar to how the IndexManager's
> are gathered at query time, and is probably therefore easy to understand

Current signature is not accepting the FullTextFilterImplementor, but
accepts multiple classes:
SearchFactory.openIndexReader(Class<?>... entities);

Since we can't use two varargs on the same method, this won't work
unless you're suggesting that we should support a single type only.

>
> B - (to be further reviewed) Have something like
> searchFactory.indexReaders().withShardingOptions( X, Y
> ).includeType(Class<?> z).openIndexReader(). This also adds the ability
> to get an IndexReader for multiple classes. But we need to think about
> the .withShardingOptions (or something similar), what input should we
> support here? Sharding properties are mostly based on some entity
> property(/ies), probably easy to be encode as String. The (custom)
> sharding strategy may use such String to select the proper index
> managers. Using a String object for identifying which index managers to
> use looks fine to me. It will be compatible with current implementation
> of custom sharding strategies where one might use the Lucene document at
> addition time, or if an entity instance will also be passed (see
> discussion [2]), the properties of that entity can probably encoded to
> some String. And if HS will cover the mapping/have support for Strings
> as identifiers for sharding instead of a user defined mapping of the
> index (integer) in the array of IndexManagers, that would be awesome :)
> (Relieves the pain of having some mapping that should be stored
> somewhere, which I currently do).

FullTextFilterImplementor could work as a withShardingOptions
parameter, but while it's true that in the end it's going to filter
some values, I wonder if the concept of Filters is misleading in this
case.

> Still, we need to know the use cases there might be, i.e. which
> flexibility the API should offer.

I'd add another option:

C - SearchFactory.openIndexReader(String... indexName);

This is simple, but it is in no way delegating to the ShardingStrategy
to make the index names choice which I think would be way more usable.

> As is also mentioned in [1], there is currently no direct access to the
> index managers, so getting a FSDirectory is currently not possible in
> 4.0alpha1. I think HS should support this to offer the flexibility to
> work on the Lucene indexes directly (for example, to build an auto
> completion/spell check index from an existing index)

Why would you need direct access to a Directory? isn't it enough to
provide access to the IndexReader ?

>
> Let's start by setting up some requirements?
> ---------
> *1 Have access to IndexReader for one class
> *2 Have access to IndexReader with a subset of IndexManagers based on
> sharding strategy. Sharding strategies are mostly based on some
> propert(y/ies) of an entity instance, which can likely be encoded to
> some String.
> *3 Have access to index directories (FSDirectory/...). Unlike previous
> versions (< HS4.0) it would be nice if this uses the ShardingStrategy
> instance in use, so mapping is completely and exclusively done in a
> ShardingStrategy

We can't provide access to a "virtual" Directory exposing the contents
of multiple Directories, that's possible with an IndexReader only.

> * ...
> ---------
>
> Please extend/modify the list of requirements if you think something is
> missing/incorrect and drop your ideas/thoughts about the mentioned
> ideas.
>
>
> Elmer
>
>
>
> [1] https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
> [2]
> http://www.mailinglistarchive.com/html/hibernate-dev@lists.jboss.org/2011-08/msg00091.html
> [3] IRC log:
>
> <elmervc> sannegrinovero, have you read/did you have time to think about
> https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
> <sannegrinovero> hi elmervc , yes I've read it. my next thing on the
> todo is to make some prototype, as I'm not happy with the current ideas:
> <sannegrinovero> elmervc, are you blocked by this? the workaround is
> very simple
> <sannegrinovero> generally, I'm wondering if we can avoid having to
> expose the DirectoryProviders. I would want them gone from the public
> API, but of course limitations like this are not acceptable.
> <elmervc> sannegrinovero, I'm branching this migration, so it's not
> really blocking. But I would like to try the new H core/search, so for
> that to work I need access to the subset of indices
> <elmervc> What workaround were you thinking about ?
> <elmervc> Just construct an index reader/FSDirs myself using 'hardcoded'
> paths ?
> <sannegrinovero> nono that's ugly..
> <sannegrinovero> elmervc, all logic to open this IR is in
> org.hibernate.search.impl.ImmutableSearchFactory.openIndexReader(Class<?>...)
> <sannegrinovero> elmervc, and it's just  a couple of lines to change ;)
> <sannegrinovero> the problem is more how to make it easy to consume
> <elmervc> Ok, I'll look into that :)
> <elmervc> Using filters is not a good idea?
> <sannegrinovero> yes I liked your suggestion. but is it enough ?
> <sannegrinovero> and how would the methods look like?
> <sannegrinovero> (i.e. the signature)
> <elmervc> SearchFactory.openIndexReader(Class<?> c,
> FullTextFilterImplementor[] filters) , or what do you mean?
> <sannegrinovero> I'd prefer SearchFactory.openIndexReader(Class<?> c,
> FullTextFilterImplementor... filters)
> <elmervc> But I'm not sure if this covers all use cases of sharding
> <sannegrinovero> elmervc, the methods don't need necessarily be defined
> on the SearchFactory. We can think of something like
> searchFactory.indexReaders().withShardingOptions( X, Y
> ).includeType(Class<?> z).openIndexReader() .. how does that look like?
> <sannegrinovero> I'm just tossing out some ideas, but then we should
> bring this up to the mailing list.
> <elmervc> the .includeType , do you mean that multiple classes can be
> included?
> <sannegrinovero> yes
> <sannegrinovero> basically the indexReaders() method would open a
> context, private to this invocation chain only. (i.e. not affecting
> other threads invoking .indexReaders() )
> <elmervc> Sounds cool. But then we need to think about
> the .withShardingOptions, or something similar. For transparancy it's
> best to have something similar to the methods in the ShardingStrategy
> interface
> <elmervc> Or something similar to what is done @ querytime, i.e.
> FullTextFilterImplementors
> <elmervc> The point is, we need to know what other use cases one might
> have
> <elmervc> That's related to how sharding is done, i.e. ... might be a
> field in the doc , full text filter, ...
> <elmervc> (doc = doc to be added)
> <sannegrinovero> yes exactly I need use cases to understand this, that's
> why your feedback is very much appreciated :)
> <elmervc> sannegrinovero, For example, our sharding strategy is based on
> some field in an entity that is added to the Lucene Document (actually,
> it has a @Field anno, and this field is removed from the Lucene Document
> in the shardingstrategy.getDirectoryProviderForAddition(...)
> <sannegrinovero> elmervc, lol that prooves another discussion I had
> recently in proposing that we should pass the entity instance and not
> the document to the sharding strategy.
> <elmervc> It might be usefull indeed, but in our case it's easier to use
> a Field in the doc, because that field will always have the same name,
> i.e. we can reuse the same sharding strategy.
> <sannegrinovero> elmervc, this discussion is very interesting but I'm
> busy in other chats now which I can't postpone. Could you please
> synthesize this and send a mail to the developer list?
>
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>




More information about the hibernate-dev mailing list