[hibernate-dev] [Search] Dynamic sharding configuration

Tue Sep 24 06:54:02 EDT 2013

On 24 September 2013 09:51, Hardy Ferentschik <hardy at hibernate.org> wrote:
>
> On 24 Jan 2013, at 12:04 AM, Sanne Grinovero <sanne at hibernate.org> wrote:
>
>> Correct me if I'm wrong, but trying to synthesize this discussion I
>> think that we're fundamentally agreeing that dynamic sharding is a
>> "better replacement" for static sharding.
>
> It has the potential for a replacement I think, but I don't think we have found the best solution yet.
> Also I want to highlight "replacement". If we go for a ShardIdentifierProvider interface of sorts
> we should definitely deprecate IndexShardingStrategy.
>
>> Still, let's keep in mind that this needs to be a backwards compatible
>> patch, so we're not looking for something disruptive of the current
>> static sharding feature.
>
> We should for sure try to keep API's stable. On the other hand I don't see why we should not be
> able to change SPI contracts. With this super restrictive behaviour we are seriously limiting
> our ability to move the software forward.

It's a consequence of having decided that this version is going to be
very conservative.
It won't be always like that, but the more we waste time fighting this decision
the more we delay 5+ where you can do these things.

>> If we end up agreeing that the better API
>> needs disruption, we should still make an incremental change available
>> in this version - as experimental - so that people can play with it,
>> then we're free to reset the API as we wish in 5.0 but at least
>> including the newly acquired experience.
>
> I keep hearing 5.0 as a some sort of catch all release. I think this thinking is very dangerous.
> Changing more than necessary is not good for the user and trying to attempt to fix all
> in one release will defer a potential 5.0 more and more, because there will be so many
> things we want to address. 5.0 should be about switching to Lucene 4. Once that is
> done we should release.

I hear you. Just saying that - on top of Lucene 4 - we will be able to
re-discuss API/SPI.
It's not mandatory we do that :-)

>
>> We then had some brainstorming on IRC which concluded that it would be
>> probably more user friendly to have the __Strategy to:
>> a) not return arrays but simple collections
>
> I am proposing sets. They also have a semantic meaning.

Looks good. Devil's advocate: we don't want to allow multiple
additions on the same index don't we?

>
>> b) avoid the confusion on the two too similar methods on the proposal
>> for ShardIdentifierProvider (next paragraph)
>
> +1
>
>> c) not apply the sharding logic based on the Document (fields) but on
>> the entity (the user type triggering the index event)
>
> +1 I think there is a big potential in this idea.
>
>> #a
>> Ok we can consider that but let's see how the code turns out.
>> Iterable<String> perhaps?
>
> Why not Set<String>?

+1

>> #b
>> we explored converging the two methods into the essential one:
>>
>>   String getShardIdentifier(Class<?> entity, Serializable id, String
>> idInString);
>
> The single method could also look like this:
>
> Set<String> getShardIdentifiers(Class<?> entity, Serializable id, String idInString, Document document);
>
> where document could potentially be null.

Having a parameter as "potentially null" is imo horrible. Especially
since we expect a Set with a single element only when the Document is
not null.

Sorry I wish we had a better idea, but given the alternatives I'm much
preferring the status quo version.
I don't think it's that bad, especially as I think we all agree it's
better to code two short methods than a single more complex one.

>> but that seems very poor in terms of flexibility, it doesn't even
>> allow access to the basic properties of the indexed entity. So that
>> degenerated in the proposal #c,
>
> Right, something like
>
> Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document);
>
> would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go

ah interesting you're thinking of a strategy whish has access to
entity && document .
I initially had thought only of one class of sharding-strategies
working on document only, + one working on entity only.

Your proposal is likely the most flexible one user-wise but we need to
see if we can find an appropriate point in the transformation chain
for this to be plugged; my only concern would be that it limits our
internal design options significantly in the longer term.

+1 to consider it in future. Will you open an issue for this?

> String getShardIdentifier( Object entity, Serializable id, String idInString, Document document);
>
> If I understand correctly the reason for returning a set of shard identifiers for the case of deletion of documents
> is that in this case the user might not be able to determine the exact shard in which the document to delete is
> (since he does not have a Lucene document). If we pass the entity itself we should be fine for addition and deletion.
> Provided of course that we have the entity as deletion as well. However, that might not be the case (see below).

No we don't have the entity at deletion, you would have it only by
forcing ORM to load entities which are deleted by id (including
collections), which is not nice in terms of database roundtrips.

>> but before moving to #c I'd venture
>> that these methods aren't that bad, they just need good documentation.
>
> Is this what we are aiming for our days? "Not bad"? Mediocre? I want to be able to say that we have good interfaces and
> we did what we could to find the best solution possible.

Since we don't have better options I indeed think it's the best
solution possible. There are technical limitations for sure.
[had not read mail until the bottom yet]

>> Also, even if we move the focus from the Document to the Entity, we
>> still don't have the fully loaded entity during a delete operation, so
>> we would still need the second method returning multiple indexes.
>
> Do we? Why? Would that not depend on the delete operation?

You often don't have the fully loaded entity during a delete. It's
potentially a proxy, having only the Id available.
You could have a whole collection of these, and you might also be
processing them in a phase where it's not allowed to trigger lazy
loading... even disregarding performance concerns I'd rather not enter
this arena, not in this version at least.

>
>> #c
>> drawbacks first:
>> = to expose the entity directly has probably some risk (the user
>> making changes to it), but that's no different to what the user can
>> already do from a FieldBridge / ClassBridge.
>
> I don't see a drawback here. This is exactly what we do with class bridges. Giving the user
> access to the entity gives him most flexibility. Even using plain Hibernate / JPA APIs you
> have to comply with serval rules in order to make things work.

Ok, right. Being overzealous listing all drawbacks :)

>
>> = won't be able to consider the output of FieldBridge / ClassBridge
>> instances as you won't have the Document
>
> We could pass the Document as well. We could call ShardIdentifierProvider just after building the
> document when we still have the entity itself and the document (in some cases it might be null of course).

Yes that sounds interesting, but :
 = how long will it take you to propose such a patch?
 = is it worth your time, considering that people have never
complained on the limitations of the static sharding SPI (other than
being static) ?

>
>> = for deletion you still don't have the entity
>
> Again, does this not depend on type of deletion?
>
>> We could list some conceptually interesting advantages here, but I'd
>> like to shut down this feature for the time being because it's
>> significantly different than the goal of providing Dynamic Sharding as
>> a feature.
>
> Why is this "significantly different"? Dynamic sharding is a new feature one way or another.

you can try building it, but as above I'm concerned that this
refactoring is more complex than you expect and not particularly
useful for the end user.
All what we're asked is to make the static sharding service support
"dynamic shards".
And additional requirement is that static sharding is working very
well today and so we're not going to break it.

>
>> I'm not saying that it doesn't have merit: its probably
>> worth exploring for 5+ but it rather seems like an _additional_ level
>> of sharding that we might want to add in future as an alternative to
>> the one focusing on the Document approach.
>
> Again, it could be combination of both.
+1

>
>> My position is that it worked well so far on static sharding, and that
>> the proposal is quite consistent with it so wouldn't be much of a pain
>> for people to adapt the new model.
>
> Again, are we looking for a mediocre or a good solution?
>
>> Let's try polishing the method
>> names (and use your imagination for a well written javadoc):
>>
>>    String getShardIdFromDocument(Class<?> entity, Serializable id,
>> String idInString, Document document);
>>
>>    Iterable<String> getShardIdsFromId(Class<?> entity, Serializable
>> id, String idInString);
>>
>> or maybe to highlight what's fundamentally different:
>>
>>    String getShardIdFromContext(Class<?> entity, Serializable id,
>> String idInString, Document document);
>>
>>    Iterable<String> getShardIdsFromReducedContext(Class<?> entity,
>> Serializable id, String idInString);
>
> -1 to both.
>
> Here is another proposal. We remove
>
> String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
>
> all together. Here is my reasoning. AFAIU, the method is there for the deletion of
> documents. In this case we don't have the Lucene document nor the entity and we need
> to know in which shard the document to delete is. The assumptions behind this method is
> that somehow given the type and id I am able to provide this shard or a subset of the shards.
> I doubt, however, that this is practically ever possible. In the end most implementations will
> have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or
> any other case where I shard depending on a given property of the entity. In this case I will
> never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String)

Very good point.
It's sad to admit that - considering sharding as a scalability tool -
it doesn't scale on delete throughput.
(I'm stressing "throughput" as technically people using sharding do it
primarily to keep index size reasonable)

Loving the idea, let's remove that method from the dynamic sharding options.

>
> In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case
> for that really? In which use case can the set of targeted shards be limited based on knowing the
> type of filers we apply?

This is a very usefull feature:
http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#query-filter-shard

>
> So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much
> simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case)
> think about optimisations.

+1/2 : let's remove the one for deletions, which I think is the main
pain point, but keep the one for queries as it is actually important.

>
> The more I think about it, the more I like this more minimalistic approach.

+1

>
>> Bonus dilemma: should we stay away from String and define some
>> "IndexIdentifier" interface ?
>
> Interesting idea, but that for sure is a disruptive change, since it will affect many APIs, unless you only
> introduce IndexIdentifier for the sharding case in which case I vote with a big -1

Agreed.

>
> --Hardy
>
>
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev