Re: [hibernate-dev] [Search] Sharding and access to (subsets) of index readers and Lucene directories in HS 4.0

Thursday, 25 August 2011

2011/8/25 Elmer van Chastelet <evanchastelet(a)gmail.com&gt;:
...
 Hi all,

 Yesterday I had a discussion with Sanne on irc [3] about the new api to
 access index readers in HS4.0. We couldn't complete our discussion
 yesterday, so let's continue here.
 As explained in the forum [1], there is currently no good solution for
 getting a reader with a subset of the indexes in a sharded environment.

 Currently two basic ideas came to mind:
 A - Have a SearchFactory.openIndexReader(Class<?> c,
 FullTextFilterImplementor...): This is similar to how the IndexManager's
 are gathered at query time, and is probably therefore easy to understand 
Current signature is not accepting the FullTextFilterImplementor, but
accepts multiple classes:
SearchFactory.openIndexReader(Class<?>... entities);

Since we can't use two varargs on the same method, this won't work
unless you're suggesting that we should support a single type only.

...

 B - (to be further reviewed) Have something like
 searchFactory.indexReaders().withShardingOptions( X, Y
 ).includeType(Class<?> z).openIndexReader(). This also adds the ability
 to get an IndexReader for multiple classes. But we need to think about
 the .withShardingOptions (or something similar), what input should we
 support here? Sharding properties are mostly based on some entity
 property(/ies), probably easy to be encode as String. The (custom)
 sharding strategy may use such String to select the proper index
 managers. Using a String object for identifying which index managers to
 use looks fine to me. It will be compatible with current implementation
 of custom sharding strategies where one might use the Lucene document at
 addition time, or if an entity instance will also be passed (see
 discussion [2]), the properties of that entity can probably encoded to
 some String. And if HS will cover the mapping/have support for Strings
 as identifiers for sharding instead of a user defined mapping of the
 index (integer) in the array of IndexManagers, that would be awesome :)
 (Relieves the pain of having some mapping that should be stored
 somewhere, which I currently do). 
FullTextFilterImplementor could work as a withShardingOptions
parameter, but while it's true that in the end it's going to filter
some values, I wonder if the concept of Filters is misleading in this
case.

...
 Still, we need to know the use cases there might be, i.e. which
 flexibility the API should offer. 
I'd add another option:

C - SearchFactory.openIndexReader(String... indexName);

This is simple, but it is in no way delegating to the ShardingStrategy
to make the index names choice which I think would be way more usable.

...
 As is also mentioned in [1], there is currently no direct access to
the
 index managers, so getting a FSDirectory is currently not possible in
 4.0alpha1. I think HS should support this to offer the flexibility to
 work on the Lucene indexes directly (for example, to build an auto
 completion/spell check index from an existing index) 
Why would you need direct access to a Directory? isn't it enough to
provide access to the IndexReader ?

...

 Let's start by setting up some requirements?
 ---------
 *1 Have access to IndexReader for one class
 *2 Have access to IndexReader with a subset of IndexManagers based on
 sharding strategy. Sharding strategies are mostly based on some
 propert(y/ies) of an entity instance, which can likely be encoded to
 some String.
 *3 Have access to index directories (FSDirectory/...). Unlike previous
 versions (< HS4.0) it would be nice if this uses the ShardingStrategy
 instance in use, so mapping is completely and exclusively done in a
 ShardingStrategy 
We can't provide access to a "virtual" Directory exposing the contents
of multiple Directories, that's possible with an IndexReader only.

...
 * ...
 ---------

 Please extend/modify the list of requirements if you think something is
 missing/incorrect and drop your ideas/thoughts about the mentioned
 ideas.

 Elmer

 [1] https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
 [2]

http://www.mailinglistarchive.com/html/hibernate-dev@lists.jboss.org/2011...
 [3] IRC log:

 <elmervc> sannegrinovero, have you read/did you have time to think about
 https://forum.hibernate.org/viewtopic.php?p=2448000#p2448000
 <sannegrinovero> hi elmervc , yes I've read it. my next thing on the
 todo is to make some prototype, as I'm not happy with the current ideas:
 <sannegrinovero> elmervc, are you blocked by this? the workaround is
 very simple
 <sannegrinovero> generally, I'm wondering if we can avoid having to
 expose the DirectoryProviders. I would want them gone from the public
 API, but of course limitations like this are not acceptable.
 <elmervc> sannegrinovero, I'm branching this migration, so it's not
 really blocking. But I would like to try the new H core/search, so for
 that to work I need access to the subset of indices
 <elmervc> What workaround were you thinking about ?
 <elmervc> Just construct an index reader/FSDirs myself using 'hardcoded'
 paths ?
 <sannegrinovero> nono that's ugly..
 <sannegrinovero> elmervc, all logic to open this IR is in
 org.hibernate.search.impl.ImmutableSearchFactory.openIndexReader(Class<?>...)
 <sannegrinovero> elmervc, and it's just  a couple of lines to change ;)
 <sannegrinovero> the problem is more how to make it easy to consume
 <elmervc> Ok, I'll look into that :)
 <elmervc> Using filters is not a good idea?
 <sannegrinovero> yes I liked your suggestion. but is it enough ?
 <sannegrinovero> and how would the methods look like?
 <sannegrinovero> (i.e. the signature)
 <elmervc> SearchFactory.openIndexReader(Class<?> c,
 FullTextFilterImplementor[] filters) , or what do you mean?
 <sannegrinovero> I'd prefer SearchFactory.openIndexReader(Class<?> c,
 FullTextFilterImplementor... filters)
 <elmervc> But I'm not sure if this covers all use cases of sharding
 <sannegrinovero> elmervc, the methods don't need necessarily be defined
 on the SearchFactory. We can think of something like
 searchFactory.indexReaders().withShardingOptions( X, Y
 ).includeType(Class<?> z).openIndexReader() .. how does that look like?
 <sannegrinovero> I'm just tossing out some ideas, but then we should
 bring this up to the mailing list.
 <elmervc> the .includeType , do you mean that multiple classes can be
 included?
 <sannegrinovero> yes
 <sannegrinovero> basically the indexReaders() method would open a
 context, private to this invocation chain only. (i.e. not affecting
 other threads invoking .indexReaders() )
 <elmervc> Sounds cool. But then we need to think about
 the .withShardingOptions, or something similar. For transparancy it's
 best to have something similar to the methods in the ShardingStrategy
 interface
 <elmervc> Or something similar to what is done @ querytime, i.e.
 FullTextFilterImplementors
 <elmervc> The point is, we need to know what other use cases one might
 have
 <elmervc> That's related to how sharding is done, i.e. ... might be a
 field in the doc , full text filter, ...
 <elmervc> (doc = doc to be added)
 <sannegrinovero> yes exactly I need use cases to understand this, that's
 why your feedback is very much appreciated :)
 <elmervc> sannegrinovero, For example, our sharding strategy is based on
 some field in an entity that is added to the Lucene Document (actually,
 it has a @Field anno, and this field is removed from the Lucene Document
 in the shardingstrategy.getDirectoryProviderForAddition(...)
 <sannegrinovero> elmervc, lol that prooves another discussion I had
 recently in proposing that we should pass the entity instance and not
 the document to the sharding strategy.
 <elmervc> It might be usefull indeed, but in our case it's easier to use
 a Field in the doc, because that field will always have the same name,
 i.e. we can reuse the same sharding strategy.
 <sannegrinovero> elmervc, this discussion is very interesting but I'm
 busy in other chats now which I can't postpone. Could you please
 synthesize this and send a mail to the developer list?

 _______________________________________________
 hibernate-dev mailing list
 hibernate-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hibernate-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] [Search] Sharding and access to (subsets) of index readers and Lucene directories in HS 4.0