[hibernate-dev] [Search] Dynamic sharding configuration
hardy at hibernate.org
Tue Sep 24 04:51:06 EDT 2013
On 24 Jan 2013, at 12:04 AM, Sanne Grinovero <sanne at hibernate.org> wrote:
> Correct me if I'm wrong, but trying to synthesize this discussion I
> think that we're fundamentally agreeing that dynamic sharding is a
> "better replacement" for static sharding.
It has the potential for a replacement I think, but I don't think we have found the best solution yet.
Also I want to highlight "replacement". If we go for a ShardIdentifierProvider interface of sorts
we should definitely deprecate IndexShardingStrategy.
> Still, let's keep in mind that this needs to be a backwards compatible
> patch, so we're not looking for something disruptive of the current
> static sharding feature.
We should for sure try to keep API's stable. On the other hand I don't see why we should not be
able to change SPI contracts. With this super restrictive behaviour we are seriously limiting
our ability to move the software forward.
> If we end up agreeing that the better API
> needs disruption, we should still make an incremental change available
> in this version - as experimental - so that people can play with it,
> then we're free to reset the API as we wish in 5.0 but at least
> including the newly acquired experience.
I keep hearing 5.0 as a some sort of catch all release. I think this thinking is very dangerous.
Changing more than necessary is not good for the user and trying to attempt to fix all
in one release will defer a potential 5.0 more and more, because there will be so many
things we want to address. 5.0 should be about switching to Lucene 4. Once that is
done we should release.
> We then had some brainstorming on IRC which concluded that it would be
> probably more user friendly to have the __Strategy to:
> a) not return arrays but simple collections
I am proposing sets. They also have a semantic meaning.
> b) avoid the confusion on the two too similar methods on the proposal
> for ShardIdentifierProvider (next paragraph)
> c) not apply the sharding logic based on the Document (fields) but on
> the entity (the user type triggering the index event)
+1 I think there is a big potential in this idea.
> Ok we can consider that but let's see how the code turns out.
> Iterable<String> perhaps?
Why not Set<String>?
> we explored converging the two methods into the essential one:
> String getShardIdentifier(Class<?> entity, Serializable id, String
The single method could also look like this:
Set<String> getShardIdentifiers(Class<?> entity, Serializable id, String idInString, Document document);
where document could potentially be null.
> but that seems very poor in terms of flexibility, it doesn't even
> allow access to the basic properties of the indexed entity. So that
> degenerated in the proposal #c,
Right, something like
Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document);
would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go
String getShardIdentifier( Object entity, Serializable id, String idInString, Document document);
If I understand correctly the reason for returning a set of shard identifiers for the case of deletion of documents
is that in this case the user might not be able to determine the exact shard in which the document to delete is
(since he does not have a Lucene document). If we pass the entity itself we should be fine for addition and deletion.
Provided of course that we have the entity as deletion as well. However, that might not be the case (see below).
> but before moving to #c I'd venture
> that these methods aren't that bad, they just need good documentation.
Is this what we are aiming for our days? "Not bad"? Mediocre? I want to be able to say that we have good interfaces and
we did what we could to find the best solution possible.
> Also, even if we move the focus from the Document to the Entity, we
> still don't have the fully loaded entity during a delete operation, so
> we would still need the second method returning multiple indexes.
Do we? Why? Would that not depend on the delete operation?
> drawbacks first:
> = to expose the entity directly has probably some risk (the user
> making changes to it), but that's no different to what the user can
> already do from a FieldBridge / ClassBridge.
I don't see a drawback here. This is exactly what we do with class bridges. Giving the user
access to the entity gives him most flexibility. Even using plain Hibernate / JPA APIs you
have to comply with serval rules in order to make things work.
> = won't be able to consider the output of FieldBridge / ClassBridge
> instances as you won't have the Document
We could pass the Document as well. We could call ShardIdentifierProvider just after building the
document when we still have the entity itself and the document (in some cases it might be null of course).
> = for deletion you still don't have the entity
Again, does this not depend on type of deletion?
> We could list some conceptually interesting advantages here, but I'd
> like to shut down this feature for the time being because it's
> significantly different than the goal of providing Dynamic Sharding as
> a feature.
Why is this "significantly different"? Dynamic sharding is a new feature one way or another.
> I'm not saying that it doesn't have merit: its probably
> worth exploring for 5+ but it rather seems like an _additional_ level
> of sharding that we might want to add in future as an alternative to
> the one focusing on the Document approach.
Again, it could be combination of both.
> My position is that it worked well so far on static sharding, and that
> the proposal is quite consistent with it so wouldn't be much of a pain
> for people to adapt the new model.
Again, are we looking for a mediocre or a good solution?
> Let's try polishing the method
> names (and use your imagination for a well written javadoc):
> String getShardIdFromDocument(Class<?> entity, Serializable id,
> String idInString, Document document);
> Iterable<String> getShardIdsFromId(Class<?> entity, Serializable
> id, String idInString);
> or maybe to highlight what's fundamentally different:
> String getShardIdFromContext(Class<?> entity, Serializable id,
> String idInString, Document document);
> Iterable<String> getShardIdsFromReducedContext(Class<?> entity,
> Serializable id, String idInString);
-1 to both.
Here is another proposal. We remove
String getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
all together. Here is my reasoning. AFAIU, the method is there for the deletion of
documents. In this case we don't have the Lucene document nor the entity and we need
to know in which shard the document to delete is. The assumptions behind this method is
that somehow given the type and id I am able to provide this shard or a subset of the shards.
I doubt, however, that this is practically ever possible. In the end most implementations will
have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or
any other case where I shard depending on a given property of the entity. In this case I will
never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String)
In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case
for that really? In which use case can the set of targeted shards be limited based on knowing the
type of filers we apply?
So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much
simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case)
think about optimisations.
The more I think about it, the more I like this more minimalistic approach.
> Bonus dilemma: should we stay away from String and define some
> "IndexIdentifier" interface ?
Interesting idea, but that for sure is a disruptive change, since it will affect many APIs, unless you only
introduce IndexIdentifier for the sharding case in which case I vote with a big -1
More information about the hibernate-dev