[hibernate-dev] [Search] Dynamic sharding configuration

Mon Sep 23 18:04:51 EDT 2013

Correct me if I'm wrong, but trying to synthesize this discussion I
think that we're fundamentally agreeing that dynamic sharding is a
"better replacement" for static sharding.
Still, let's keep in mind that this needs to be a backwards compatible
patch, so we're not looking for something disruptive of the current
static sharding feature. If we end up agreeing that the better API
needs disruption, we should still make an incremental change available
in this version - as experimental - so that people can play with it,
then we're free to reset the API as we wish in 5.0 but at least
including the newly acquired experience.

We then had some brainstorming on IRC which concluded that it would be
probably more user friendly to have the __Strategy to:
 a) not return arrays but simple collections
 b) avoid the confusion on the two too similar methods on the proposal
for ShardIdentifierProvider (next paragraph)
 c) not apply the sharding logic based on the Document (fields) but on
the entity (the user type triggering the index event)

As a reminder, the proposal for ShardIdentifierProvider is - omitting
initialize - :

    /**
     * Returns the shard identifier upon addition.
     */
    String getShardIdentifier(Class<?> entity, Serializable id, String
idInString, Document document);

    /**
     * Returns the set of shard identifiers upon deletion.
     */
    String[] getShardIdentifiers(Class<?> entity, Serializable id,
String idInString);

    /**
     * Returns the set of shard identifiers for a query.
     */
    String[] getShardIdentifiersForQuery(FullTextFilterImplementor[]
fullTextFilters);

    /**
     * Returns the list of all known shard identifiers.
     * The list can vary between calls.
     */
    String[] getAllShardIdentifiers();

#a
Ok we can consider that but let's see how the code turns out.
Iterable<String> perhaps?

#b
we explored converging the two methods into the essential one:

   String getShardIdentifier(Class<?> entity, Serializable id, String
idInString);

but that seems very poor in terms of flexibility, it doesn't even
allow access to the basic properties of the indexed entity. So that
degenerated in the proposal #c, but before moving to #c I'd venture
that these methods aren't that bad, they just need good documentation.
Also, even if we move the focus from the Document to the Entity, we
still don't have the fully loaded entity during a delete operation, so
we would still need the second method returning multiple indexes.

#c
drawbacks first:
 = to expose the entity directly has probably some risk (the user
making changes to it), but that's no different to what the user can
already do from a FieldBridge / ClassBridge.
 = won't be able to consider the output of FieldBridge / ClassBridge
instances as you won't have the Document
 = for deletion you still don't have the entity

We could list some conceptually interesting advantages here, but I'd
like to shut down this feature for the time being because it's
significantly different than the goal of providing Dynamic Sharding as
a feature. I'm not saying that it doesn't have merit: its probably
worth exploring for 5+ but it rather seems like an _additional_ level
of sharding that we might want to add in future as an alternative to
the one focusing on the Document approach.

So assuming that sounds reasonable, and that we're on the same page
for a#, let's focus on b#: the duality of the methods for add/remove
during sharding.
My position is that it worked well so far on static sharding, and that
the proposal is quite consistent with it so wouldn't be much of a pain
for people to adapt the new model. Let's try polishing the method
names (and use your imagination for a well written javadoc):

    String getShardIdFromDocument(Class<?> entity, Serializable id,
String idInString, Document document);

    Iterable<String> getShardIdsFromId(Class<?> entity, Serializable
id, String idInString);

or maybe to highlight what's fundamentally different:

    String getShardIdFromContext(Class<?> entity, Serializable id,
String idInString, Document document);

    Iterable<String> getShardIdsFromReducedContext(Class<?> entity,
Serializable id, String idInString);

I especially like the javadoc! well done.
WFYT?

Bonus dilemma: should we stay away from String and define some
"IndexIdentifier" interface ? Note that while design wise it might
look good, this is on a very hot path so while I'd like proposing such
an SPI it would be great if we could find a way in which this would
not require to allocate objects at runtime just to replace a mere
string with a safer type.

Sanne

On 23 September 2013 13:58, Hardy Ferentschik <hardy at hibernate.org> wrote:
>
> On 23 Jan 2013, at 1:55 PM, Sanne Grinovero <sanne at hibernate.org> wrote:
>
>>> Or I set a custom sharding strategy which does not care about the number of shards?
>>
>> I think that's far fetched. The NBR_OF_SHARDS option defines the size
>> of the array of indexes passed to the IndexShardingStrategy so it's
>> hard to ignore. Sure it's possible, we could throw a
>> log.userIsAnIdiotException() but someone might not see the humor :-).
>> Worst case it degenerates in a case similar to your example of
>> "NotShardedStrategy and nbr_of_shards >1", or the user would notice
>> with an ArrayOutOfBounds.. usercode problem.
>
> I think I am thinking about a custom dynamic sharding strategy in this case.
>
>>> IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy.
>>
>> Note that so far we don't expect users to explicitly set the
>> NotShardedStrategy: it's simply a consequence of not having set any
>> option; if the user sets only the number of shards but omits picking a
>> specific strategy, we automatically assume he's going for the
>> IdHashShardingStrategy
>
> I get that, but as I said, I see it as a bit of misguided "ease of use", since it can introduce
> problems (see above) and is not consistent.
>
>> As soon as a different IndexShardingStrategy is chosen, then I think
>> it's quite self-explanatory that setting NBR_OF_SHARDS is quite
>> useful: the user will have coded an explicit IndexShardingStrategy and
>> consequentially have a clear idea of how many shards he wants, at
>> least for the static sharding so far.
>
> Right, for static sharding maybe, but we are going to make this dynamic now.
>
>>> With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the
>>> right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy
>>> not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest:
>>>
>>> - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which
>>>   might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them
>>
>> Conceptually it sounds nice.
>> I see two downsides:
>> - it pushes complexity to the IndexShardingStrategy implementor (the
>> user) as he needs to parse it and somehow he needs to request those
>> indexes from the SearchFactory to be built. Pushing both these
>> responsibilities to the end user in exchange for a one-liner in the
>> configuration file seems like an odd choice? I would agree if it was
>> us to write the code, but I really expect most people to plug their
>> own strategy as IdHashShardingStrategy isn't very useful in a real
>> world app.
>
> I think that if the API to create an index manager is simple it basically unifies static
> and dynamic sharding.
>
>> - today we pre-initialize the indexes (IndexManagers) before they are
>> passed to the IndexShardingStrategy # initialize method. We would need
>> to pass instead some lifecycle-controlling objects which allows the
>> user to trigger index initialization. Again I essentially agree but
>> that sounds much like dynamic sharding?
>
> With the right API both would be possible in a simple to understand way.
>
>> I don't think we can change these in the scope of 4.4 as it affects
>> current API. Shall we take this inconsistency point as
>> yet-another-reason to migrate to Dynamic Sharding? While the new
>> feature matures, I suspect it could completely replace the static one.
>
> I think so as well, in which case we need to make sure that we get the API right. A
> new/updated initialise contract might be exactly what we need instead of yet another patch.
> See also my email regarding ShardIdentifierProvider.
>
>>> - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes
>>
>> -1 : as I reminded above, I don't expect id-hash to be of practical
>> use, people want to plug their own strategies which implies we need
>> the concrete implementation classes. I'd rather see the
>> IdHashShardingStrategy as a concrete example we're providing (not just
>> an example, I guess someone might find it useful in production, I just
>> think it's a minority of the IndexShardingStrategy users).
>
> IdHashShardingStrategy is in use right now, at least if you enable sharding without any other specific
> IndexShardingStrategy implementation. Providing a short name of it is inline with configuration options
> like 'ram' or 'filesystem' for directory provides. We could use 'default' to hide the fact which impl
> we are using. This way we could even replace the impl in case we find a better one.
>
>>> - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding).
>>>  This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards'
>>
>> To recap today we have
>>    hibernate.search.[indexName].sharding_strategy = [implementation
>> of IndexShardingStrategy]
>>
>> Would it not be nice if I could either specify an implementation of
>> IndexShardingStrategy or a ShardIdentifierProvider ?
>>    hibernate.search.[indexName].sharding_strategy = [implementation
>> of IndexShardingStrategy | ShardIdentifierProvider]
>
> hmm, I have not thought about it this way. So far I was more thinking along the
> lines of removing ShardIdentifierProvider. But you are proposing to keep it
> and maybe in the long run remove IndexShardingStrategy?
>
>> -  being the same property keeps it clear that you can either specify
>> one OR the other.
>
> Except that we are talking about to different interfaces. Hardly good practice to offer this type of
> confguration.
>
> --Hardy