[hibernate-dev] [Search] Dynamic sharding configuration

Tue Sep 24 08:12:21 EDT 2013

On 24 Jan 2013, at 12:54 PM, Sanne Grinovero <sanne at hibernate.org> wrote:

>> We should for sure try to keep API's stable. On the other hand I don't see why we should not be
>> able to change SPI contracts. With this super restrictive behaviour we are seriously limiting
>> our ability to move the software forward.
> 
> It's a consequence of having decided that this version is going to be
> very conservative.
> It won't be always like that, but the more we waste time fighting this decision
> the more we delay 5+ where you can do these things.

I don't see why we should treat this release different from others regarding policies on do and
don't do between versions. AFAIU an SPI change between 4.3 and 4.4 is within our policies.

>>> We then had some brainstorming on IRC which concluded that it would be
>>> probably more user friendly to have the __Strategy to:
>>> a) not return arrays but simple collections
>> 
>> I am proposing sets. They also have a semantic meaning.
> 
> Looks good. Devil's advocate: we don't want to allow multiple
> additions on the same index don't we?

What would that even mean? So instead of a set you are saying it would make sense 
to have a list (or array) in which a given shard appears more than once? I don't see
a use for this? 

>>> but that seems very poor in terms of flexibility, it doesn't even
>>> allow access to the basic properties of the indexed entity. So that
>>> degenerated in the proposal #c,
>> 
>> Right, something like
>> 
>> Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document);
>> 
>> would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go
> 
> ah interesting you're thinking of a strategy whish has access to
> entity && document .
> I initially had thought only of one class of sharding-strategies
> working on document only, + one working on entity only.

No, I thought indeed that we could make both available.

>>> Also, even if we move the focus from the Document to the Entity, we
>>> still don't have the fully loaded entity during a delete operation, so
>>> we would still need the second method returning multiple indexes.
>> 
>> Do we? Why? Would that not depend on the delete operation?
> 
> You often don't have the fully loaded entity during a delete. It's
> potentially a proxy, having only the Id available.
> You could have a whole collection of these, and you might also be
> processing them in a phase where it's not allowed to trigger lazy
> loading... even disregarding performance concerns I'd rather not enter
> this arena, not in this version at least.

Ok, fair enough. That's of course a bit of a down turn. However, it also shows
that no matter how we turn and twist it, for deletion we will have to target all shards
(except maybe for some contrived border cases)

>> Here is another proposal. We remove
>> 
>> String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
>> 
>> all together. Here is my reasoning. AFAIU, the method is there for the deletion of
>> documents. In this case we don't have the Lucene document nor the entity and we need
>> to know in which shard the document to delete is. The assumptions behind this method is
>> that somehow given the type and id I am able to provide this shard or a subset of the shards.
>> I doubt, however, that this is practically ever possible. In the end most implementations will
>> have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or
>> any other case where I shard depending on a given property of the entity. In this case I will
>> never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String)
> 
> Very good point.
> It's sad to admit that - considering sharding as a scalability tool -
> it doesn't scale on delete throughput.
> (I'm stressing "throughput" as technically people using sharding do it
> primarily to keep index size reasonable)
> 
> Loving the idea, let's remove that method from the dynamic sharding options.

:-) Now we are getting somewhere.

>> In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case
>> for that really? In which use case can the set of targeted shards be limited based on knowing the
>> type of filers we apply?
> 
> This is a very usefull feature:
> http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#query-filter-shard

ok

>> So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much
>> simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case)
>> think about optimisations.
> 
> +1/2 : let's remove the one for deletions, which I think is the main
> pain point, but keep the one for queries as it is actually important.

ok, but it would return a Set<String> right? Also should we still rename getShardIdentifier to getShardIdentifierForAddition?

Let's sum the things up which are needed for finishing dynamic sharding

1) deprecate IndexShardingStrategy
2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider
3) change return types for getShardIdentifiersForQuery and getAllShardIdentifiers to Set<String>
4) remove the sub-property 'shard_identity_provider'. ShardIdentiferProvider is configured using the existing 'hibernate.search.sharding_strategy'
     we will need to make the appropriate type checks to instantiate the right things (is this really a good idea?)
5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition'
6) reflect all this in the docs

Agreed?

--Hardy