[hibernate-dev] HSearch: Using sharding and avoiding query on multiple shards

Sun Aug 3 13:04:51 EDT 2008

--
Emmanuel Bernard
http://in.relation.to/Bloggers/Emmanuel | http://blog.emmanuelbernard.com 
  | http://twitter.com/emmanuelbernard
Hibernate Search in Action (http://is.gd/Dl1)

On  Aug 3, 2008, at 09:15, Sanne Grinovero wrote:

> 2008/8/1 Emmanuel Bernard <emmanuel at hibernate.org>:
>>
>> On  Aug 1, 2008, at 13:42, Sanne Grinovero wrote:
>>
>>> Hello Emmanuel,
>>>
>>> 2008/7/31 Emmanuel Bernard <emmanuel at hibernate.org>:
>>>>
>>>> On  Jul 31, 2008, at 09:22, Sanne Grinovero wrote:
>>>>>
>>>>> about the API, wouldn't it make more sense to have it look like a
>>>>> filter?
>>>>
>>>> can you give more details?
>>>
>>> I was just thinking about the name  
>>> "fullTextQuery.setShardHint("Sony");":
>>> I wouldn't call it a "hint", but a filter as it could affect the  
>>> results;
>>> A "hint" sounds like you are trying to improve the performance in
>>> a way that shouldn't change the result, so:
>>>
>>> fullTextQuery.enableFullTextFilter("Sony")
>>>
>>> and it could differ from a normal FullTextFilter only by it's  
>>> concrete
>>> implementation.
>>> Just my 2cents, as I think the effect is the same.
>>
>> Interesting concept and much more transparent. Not sure how easy it  
>> is to do
>> that though. A typical filter is cached per IndexReader. We cannot  
>> do that
>> for the "special" filter as opening the index defeats the purpose.  
>> Lucene
>> filters are applied per IndexReader so too late in the game.
>
> You don't need to cache this, as it doesn't really contain the
> filtered data, so we can just
> avoid that. When opening the readers we could look at enabled filters,
> and if there's one
> of this type we just affect the selection of indexes to really open
> (delegate the sharing impl
> to make the right choice); no need to apply a real Lucene filter  
> afterwards.
> (it should perform as a cached filter which survives even a index
> reopening, nice!)
> We could look at the filtertypes by name, and put them in separate
> containers at startup to
> avoid the type-checking at runtime.

It's worth trying a prototype. We should open a JIRA issue to capture  
that.

>
>
>>
>>>
>>> the feature looks great, but in my case I would need the ability in
>>> the ShardingStrategy to create new
>>> indexes; what do you think about that? I mean the size of the arrays
>>> could need to grow.
>>
>> Yes that's a feature I thought about but it means we will run into  
>> a lot of
>> concurrency issues (the HSearch config is all done at init time  
>> today). If
>> we do that this needs to be well thought and I am not sure how  
>> feasible it
>> is.
> Yes that's why I think we should move away of identifying the shards  
> with a
> index number, but give them "names" or some other way to identify  
> them.
> Nothing stops your default sharding strategies to expect names as  
> "1" and "2",
> but other implementations could prefer a different naming scheme,
> and it could be more readable in the configuration files to select
> different indexing parameters per shard.

I don't see how a different naming scheme helps solving the  
concurrency issues.

>
>>
>>>
>>> Basically all my content is "clustered" in some macrocategories, and
>>> usually the search is done after
>>> having selected the category: so it would be perfect to have  
>>> actually
>>> different indexes per cat.,
>>> but eventually someone could need to add a new category, the
>>> shardingStrategy would need to write
>>> a new empty index.
>>> I would like also the possibility to move away from array-indexes to
>>> some other identifier for the shards;
>>
>> I am not sure what you gain from that. In any ways, your  
>> ShardingStrategy
>> can do the conversion from your cat name to the shard index.
>>
>>>
>>> in my specific case I would love to use something like the PK of the
>>> category: this could enable
>>> an easy filter selection (category could be the parameter of the
>>> filter) and enable something like
>>> "Cascade delete the index" on category removal.
>>> This could become a special implementation of ShardingStrategy, to  
>>> be
>>> mandatory when using this kind of filtering?
>>>
>>> btw, I've committed some more fixes for HSEARCH-241
>>>
>>> kind regards,
>>> Sanne
>>
>>