On 24 Jan 2013, at 12:04 AM, Sanne Grinovero <sanne(a)hibernate.org> wrote:
Correct me if I'm wrong, but trying to synthesize this discussion
I
think that we're fundamentally agreeing that dynamic sharding is a
"better replacement" for static sharding.
It has the potential for a replacement I think, but I don't think we have found the
best solution yet.
Also I want to highlight "replacement". If we go for a ShardIdentifierProvider
interface of sorts
we should definitely deprecate IndexShardingStrategy.
Still, let's keep in mind that this needs to be a backwards
compatible
patch, so we're not looking for something disruptive of the current
static sharding feature.
We should for sure try to keep API's stable. On the other hand I don't see why we
should not be
able to change SPI contracts. With this super restrictive behaviour we are seriously
limiting
our ability to move the software forward.
If we end up agreeing that the better API
needs disruption, we should still make an incremental change available
in this version - as experimental - so that people can play with it,
then we're free to reset the API as we wish in 5.0 but at least
including the newly acquired experience.
I keep hearing 5.0 as a some sort of catch all release. I think this thinking is very
dangerous.
Changing more than necessary is not good for the user and trying to attempt to fix all
in one release will defer a potential 5.0 more and more, because there will be so many
things we want to address. 5.0 should be about switching to Lucene 4. Once that is
done we should release.
We then had some brainstorming on IRC which concluded that it would
be
probably more user friendly to have the __Strategy to:
a) not return arrays but simple collections
I am proposing sets. They also have a semantic meaning.
b) avoid the confusion on the two too similar methods on the
proposal
for ShardIdentifierProvider (next paragraph)
+1
c) not apply the sharding logic based on the Document (fields) but
on
the entity (the user type triggering the index event)
+1 I think there is a big potential in this idea.
#a
Ok we can consider that but let's see how the code turns out.
Iterable<String> perhaps?
Why not Set<String>?
#b
we explored converging the two methods into the essential one:
String getShardIdentifier(Class<?> entity, Serializable id, String
idInString);
The single method could also look like this:
Set<String> getShardIdentifiers(Class<?> entity, Serializable id, String
idInString, Document document);
where document could potentially be null.
but that seems very poor in terms of flexibility, it doesn't
even
allow access to the basic properties of the indexed entity. So that
degenerated in the proposal #c,
Right, something like
Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString,
Document document);
would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even
go
String getShardIdentifier( Object entity, Serializable id, String idInString, Document
document);
If I understand correctly the reason for returning a set of shard identifiers for the case
of deletion of documents
is that in this case the user might not be able to determine the exact shard in which the
document to delete is
(since he does not have a Lucene document). If we pass the entity itself we should be fine
for addition and deletion.
Provided of course that we have the entity as deletion as well. However, that might not be
the case (see below).
but before moving to #c I'd venture
that these methods aren't that bad, they just need good documentation.
Is this what we are aiming for our days? "Not bad"? Mediocre? I want to be able
to say that we have good interfaces and
we did what we could to find the best solution possible.
Also, even if we move the focus from the Document to the Entity, we
still don't have the fully loaded entity during a delete operation, so
we would still need the second method returning multiple indexes.
Do we? Why? Would that not depend on the delete operation?
#c
drawbacks first:
= to expose the entity directly has probably some risk (the user
making changes to it), but that's no different to what the user can
already do from a FieldBridge / ClassBridge.
I don't see a drawback here. This is exactly what we do with class bridges. Giving the
user
access to the entity gives him most flexibility. Even using plain Hibernate / JPA APIs you
have to comply with serval rules in order to make things work.
= won't be able to consider the output of FieldBridge /
ClassBridge
instances as you won't have the Document
We could pass the Document as well. We could call ShardIdentifierProvider just after
building the
document when we still have the entity itself and the document (in some cases it might be
null of course).
= for deletion you still don't have the entity
Again, does this not depend on type of deletion?
We could list some conceptually interesting advantages here, but
I'd
like to shut down this feature for the time being because it's
significantly different than the goal of providing Dynamic Sharding as
a feature.
Why is this "significantly different"? Dynamic sharding is a new feature one way
or another.
I'm not saying that it doesn't have merit: its probably
worth exploring for 5+ but it rather seems like an _additional_ level
of sharding that we might want to add in future as an alternative to
the one focusing on the Document approach.
Again, it could be combination of both.
My position is that it worked well so far on static sharding, and
that
the proposal is quite consistent with it so wouldn't be much of a pain
for people to adapt the new model.
Again, are we looking for a mediocre or a good solution?
Let's try polishing the method
names (and use your imagination for a well written javadoc):
String getShardIdFromDocument(Class<?> entity, Serializable id,
String idInString, Document document);
Iterable<String> getShardIdsFromId(Class<?> entity, Serializable
id, String idInString);
or maybe to highlight what's fundamentally different:
String getShardIdFromContext(Class<?> entity, Serializable id,
String idInString, Document document);
Iterable<String> getShardIdsFromReducedContext(Class<?> entity,
Serializable id, String idInString);
-1 to both.
Here is another proposal. We remove
String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString);
all together. Here is my reasoning. AFAIU, the method is there for the deletion of
documents. In this case we don't have the Lucene document nor the entity and we need
to know in which shard the document to delete is. The assumptions behind this method is
that somehow given the type and id I am able to provide this shard or a subset of the
shards.
I doubt, however, that this is practically ever possible. In the end most implementations
will
have to just delegate to getAllShardIdentifiers() anyways. Take the language code example
or
any other case where I shard depending on a given property of the entity. In this case I
will
never be able to make any use of #getShardIdentifiers(Class<?> , Serializable ,
String)
In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use
case
for that really? In which use case can the set of targeted shards be limited based on
knowing the
type of filers we apply?
So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with
a much
simpler interface. We can indeed mark it as experimental and if the need arises (based on
a true use case)
think about optimisations.
The more I think about it, the more I like this more minimalistic approach.
Bonus dilemma: should we stay away from String and define some
"IndexIdentifier" interface ?
Interesting idea, but that for sure is a disruptive change, since it will affect many
APIs, unless you only
introduce IndexIdentifier for the sharding case in which case I vote with a big -1
--Hardy