[hibernate-dev] [Search] Dynamic sharding configuration

Tue Sep 24 08:30:41 EDT 2013

On 24 September 2013 14:12, Hardy Ferentschik <hardy at hibernate.org> wrote:
>
> On 24 Jan 2013, at 12:54 PM, Sanne Grinovero <sanne at hibernate.org> wrote:
>
>>> We should for sure try to keep API's stable. On the other hand I don't see why we should not be
>>> able to change SPI contracts. With this super restrictive behaviour we are seriously limiting
>>> our ability to move the software forward.
>>
>> It's a consequence of having decided that this version is going to be
>> very conservative.
>> It won't be always like that, but the more we waste time fighting this decision
>> the more we delay 5+ where you can do these things.
>
> I don't see why we should treat this release different from others regarding policies on do and
> don't do between versions. AFAIU an SPI change between 4.3 and 4.4 is within our policies.
>
>
>>>> We then had some brainstorming on IRC which concluded that it would be
>>>> probably more user friendly to have the __Strategy to:
>>>> a) not return arrays but simple collections
>>>
>>> I am proposing sets. They also have a semantic meaning.
>>
>> Looks good. Devil's advocate: we don't want to allow multiple
>> additions on the same index don't we?
>
> What would that even mean? So instead of a set you are saying it would make sense
> to have a list (or array) in which a given shard appears more than once? I don't see
> a use for this?
>
>
>>>> but that seems very poor in terms of flexibility, it doesn't even
>>>> allow access to the basic properties of the indexed entity. So that
>>>> degenerated in the proposal #c,
>>>
>>> Right, something like
>>>
>>> Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document);
>>>
>>> would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go
>>
>> ah interesting you're thinking of a strategy whish has access to
>> entity && document .
>> I initially had thought only of one class of sharding-strategies
>> working on document only, + one working on entity only.
>
> No, I thought indeed that we could make both available.
>
>>>> Also, even if we move the focus from the Document to the Entity, we
>>>> still don't have the fully loaded entity during a delete operation, so
>>>> we would still need the second method returning multiple indexes.
>>>
>>> Do we? Why? Would that not depend on the delete operation?
>>
>> You often don't have the fully loaded entity during a delete. It's
>> potentially a proxy, having only the Id available.
>> You could have a whole collection of these, and you might also be
>> processing them in a phase where it's not allowed to trigger lazy
>> loading... even disregarding performance concerns I'd rather not enter
>> this arena, not in this version at least.
>
> Ok, fair enough. That's of course a bit of a down turn. However, it also shows
> that no matter how we turn and twist it, for deletion we will have to target all shards
> (except maybe for some contrived border cases)
>
>>> Here is another proposal. We remove
>>>
>>> String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
>>>
>>> all together. Here is my reasoning. AFAIU, the method is there for the deletion of
>>> documents. In this case we don't have the Lucene document nor the entity and we need
>>> to know in which shard the document to delete is. The assumptions behind this method is
>>> that somehow given the type and id I am able to provide this shard or a subset of the shards.
>>> I doubt, however, that this is practically ever possible. In the end most implementations will
>>> have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or
>>> any other case where I shard depending on a given property of the entity. In this case I will
>>> never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String)
>>
>> Very good point.
>> It's sad to admit that - considering sharding as a scalability tool -
>> it doesn't scale on delete throughput.
>> (I'm stressing "throughput" as technically people using sharding do it
>> primarily to keep index size reasonable)
>>
>> Loving the idea, let's remove that method from the dynamic sharding options.
>
> :-) Now we are getting somewhere.
>
>>> In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case
>>> for that really? In which use case can the set of targeted shards be limited based on knowing the
>>> type of filers we apply?
>>
>> This is a very usefull feature:
>> http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#query-filter-shard
>
> ok
>
>>> So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much
>>> simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case)
>>> think about optimisations.
>>
>> +1/2 : let's remove the one for deletions, which I think is the main
>> pain point, but keep the one for queries as it is actually important.
>
> ok, but it would return a Set<String> right? Also should we still rename getShardIdentifier to getShardIdentifierForAddition?
>
> Let's sum the things up which are needed for finishing dynamic sharding
>
> 1) deprecate IndexShardingStrategy

+1
I'm  not fully sure if the static one doesn't deserve existing
anymore, especially since we just decided that the dynamic sharding
will not allow fine-level control on deletetions for example.
But let's deprecate it and move on, worst case we can change our mind,
or at least it might generate the feedback we need.

> 2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider

+1 we're automatically assuming a deletion needs to be routed to all
identifiers.

> 3) change return types for getShardIdentifiersForQuery and getAllShardIdentifiers to Set<String>

+1
I have to admit I was having bad feelings about this Set<> because of
this being a hotspot and I would love to get rid of object
allocations, *especially* heavy objects such as Set.
The good thing about the other method removal is that it basically
removes the need for dynamic construction of such a Set contents: this
one can easily be cached by an implementor.

> 4) remove the sub-property 'shard_identity_provider'. ShardIdentiferProvider is configured using the existing 'hibernate.search.sharding_strategy'
>      we will need to make the appropriate type checks to instantiate the right things (is this really a good idea?)

+1
Doesn't look hard, let's try.

> 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition'

Is that needed? I thought that by removing the conflicting method
there would be no further need to clarify the method. I'd propose to
keep the method name as is, we still have the javadocs asset to
clarify how this is all being used for the apprehensive user.

> 6) reflect all this in the docs

+1

>
> Agreed?
>
> --Hardy
>
>
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev