[hibernate-dev] [Search] Dynamic sharding configuration

Emmanuel Bernard emmanuel at hibernate.org
Wed Oct 2 09:21:28 EDT 2013


What does Iterable<String> give you over String[]?

On Mon 2013-09-23 23:04, Sanne Grinovero wrote:
> Correct me if I'm wrong, but trying to synthesize this discussion I
> think that we're fundamentally agreeing that dynamic sharding is a
> "better replacement" for static sharding.
> Still, let's keep in mind that this needs to be a backwards compatible
> patch, so we're not looking for something disruptive of the current
> static sharding feature. If we end up agreeing that the better API
> needs disruption, we should still make an incremental change available
> in this version - as experimental - so that people can play with it,
> then we're free to reset the API as we wish in 5.0 but at least
> including the newly acquired experience.
> 
> We then had some brainstorming on IRC which concluded that it would be
> probably more user friendly to have the __Strategy to:
>  a) not return arrays but simple collections
>  b) avoid the confusion on the two too similar methods on the proposal
> for ShardIdentifierProvider (next paragraph)
>  c) not apply the sharding logic based on the Document (fields) but on
> the entity (the user type triggering the index event)
> 
> As a reminder, the proposal for ShardIdentifierProvider is - omitting
> initialize - :
> 
>     /**
>      * Returns the shard identifier upon addition.
>      */
>     String getShardIdentifier(Class<?> entity, Serializable id, String
> idInString, Document document);
> 
>     /**
>      * Returns the set of shard identifiers upon deletion.
>      */
>     String[] getShardIdentifiers(Class<?> entity, Serializable id,
> String idInString);
> 
>     /**
>      * Returns the set of shard identifiers for a query.
>      */
>     String[] getShardIdentifiersForQuery(FullTextFilterImplementor[]
> fullTextFilters);
> 
>     /**
>      * Returns the list of all known shard identifiers.
>      * The list can vary between calls.
>      */
>     String[] getAllShardIdentifiers();
> 
> #a
> Ok we can consider that but let's see how the code turns out.
> Iterable<String> perhaps?
> 
> #b
> we explored converging the two methods into the essential one:
> 
>    String getShardIdentifier(Class<?> entity, Serializable id, String
> idInString);
> 
> but that seems very poor in terms of flexibility, it doesn't even
> allow access to the basic properties of the indexed entity. So that
> degenerated in the proposal #c, but before moving to #c I'd venture
> that these methods aren't that bad, they just need good documentation.
> Also, even if we move the focus from the Document to the Entity, we
> still don't have the fully loaded entity during a delete operation, so
> we would still need the second method returning multiple indexes.
> 
> #c
> drawbacks first:
>  = to expose the entity directly has probably some risk (the user
> making changes to it), but that's no different to what the user can
> already do from a FieldBridge / ClassBridge.
>  = won't be able to consider the output of FieldBridge / ClassBridge
> instances as you won't have the Document
>  = for deletion you still don't have the entity
> 
> We could list some conceptually interesting advantages here, but I'd
> like to shut down this feature for the time being because it's
> significantly different than the goal of providing Dynamic Sharding as
> a feature. I'm not saying that it doesn't have merit: its probably
> worth exploring for 5+ but it rather seems like an _additional_ level
> of sharding that we might want to add in future as an alternative to
> the one focusing on the Document approach.
> 
> So assuming that sounds reasonable, and that we're on the same page
> for a#, let's focus on b#: the duality of the methods for add/remove
> during sharding.
> My position is that it worked well so far on static sharding, and that
> the proposal is quite consistent with it so wouldn't be much of a pain
> for people to adapt the new model. Let's try polishing the method
> names (and use your imagination for a well written javadoc):
> 
>     String getShardIdFromDocument(Class<?> entity, Serializable id,
> String idInString, Document document);
> 
>     Iterable<String> getShardIdsFromId(Class<?> entity, Serializable
> id, String idInString);
> 
> or maybe to highlight what's fundamentally different:
> 
>     String getShardIdFromContext(Class<?> entity, Serializable id,
> String idInString, Document document);
> 
>     Iterable<String> getShardIdsFromReducedContext(Class<?> entity,
> Serializable id, String idInString);
> 
> 
> I especially like the javadoc! well done.
> WFYT?
> 
> 
> Bonus dilemma: should we stay away from String and define some
> "IndexIdentifier" interface ? Note that while design wise it might
> look good, this is on a very hot path so while I'd like proposing such
> an SPI it would be great if we could find a way in which this would
> not require to allocate objects at runtime just to replace a mere
> string with a safer type.
> 
> Sanne
> 
> 
> 
> On 23 September 2013 13:58, Hardy Ferentschik <hardy at hibernate.org> wrote:
> >
> > On 23 Jan 2013, at 1:55 PM, Sanne Grinovero <sanne at hibernate.org> wrote:
> >
> >>> Or I set a custom sharding strategy which does not care about the number of shards?
> >>
> >> I think that's far fetched. The NBR_OF_SHARDS option defines the size
> >> of the array of indexes passed to the IndexShardingStrategy so it's
> >> hard to ignore. Sure it's possible, we could throw a
> >> log.userIsAnIdiotException() but someone might not see the humor :-).
> >> Worst case it degenerates in a case similar to your example of
> >> "NotShardedStrategy and nbr_of_shards >1", or the user would notice
> >> with an ArrayOutOfBounds.. usercode problem.
> >
> > I think I am thinking about a custom dynamic sharding strategy in this case.
> >
> >>> IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy.
> >>
> >> Note that so far we don't expect users to explicitly set the
> >> NotShardedStrategy: it's simply a consequence of not having set any
> >> option; if the user sets only the number of shards but omits picking a
> >> specific strategy, we automatically assume he's going for the
> >> IdHashShardingStrategy
> >
> > I get that, but as I said, I see it as a bit of misguided "ease of use", since it can introduce
> > problems (see above) and is not consistent.
> >
> >> As soon as a different IndexShardingStrategy is chosen, then I think
> >> it's quite self-explanatory that setting NBR_OF_SHARDS is quite
> >> useful: the user will have coded an explicit IndexShardingStrategy and
> >> consequentially have a clear idea of how many shards he wants, at
> >> least for the static sharding so far.
> >
> > Right, for static sharding maybe, but we are going to make this dynamic now.
> >
> >>> With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the
> >>> right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy
> >>> not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest:
> >>>
> >>> - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which
> >>>   might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them
> >>
> >> Conceptually it sounds nice.
> >> I see two downsides:
> >> - it pushes complexity to the IndexShardingStrategy implementor (the
> >> user) as he needs to parse it and somehow he needs to request those
> >> indexes from the SearchFactory to be built. Pushing both these
> >> responsibilities to the end user in exchange for a one-liner in the
> >> configuration file seems like an odd choice? I would agree if it was
> >> us to write the code, but I really expect most people to plug their
> >> own strategy as IdHashShardingStrategy isn't very useful in a real
> >> world app.
> >
> > I think that if the API to create an index manager is simple it basically unifies static
> > and dynamic sharding.
> >
> >> - today we pre-initialize the indexes (IndexManagers) before they are
> >> passed to the IndexShardingStrategy # initialize method. We would need
> >> to pass instead some lifecycle-controlling objects which allows the
> >> user to trigger index initialization. Again I essentially agree but
> >> that sounds much like dynamic sharding?
> >
> > With the right API both would be possible in a simple to understand way.
> >
> >> I don't think we can change these in the scope of 4.4 as it affects
> >> current API. Shall we take this inconsistency point as
> >> yet-another-reason to migrate to Dynamic Sharding? While the new
> >> feature matures, I suspect it could completely replace the static one.
> >
> > I think so as well, in which case we need to make sure that we get the API right. A
> > new/updated initialise contract might be exactly what we need instead of yet another patch.
> > See also my email regarding ShardIdentifierProvider.
> >
> >>> - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes
> >>
> >> -1 : as I reminded above, I don't expect id-hash to be of practical
> >> use, people want to plug their own strategies which implies we need
> >> the concrete implementation classes. I'd rather see the
> >> IdHashShardingStrategy as a concrete example we're providing (not just
> >> an example, I guess someone might find it useful in production, I just
> >> think it's a minority of the IndexShardingStrategy users).
> >
> > IdHashShardingStrategy is in use right now, at least if you enable sharding without any other specific
> > IndexShardingStrategy implementation. Providing a short name of it is inline with configuration options
> > like 'ram' or 'filesystem' for directory provides. We could use 'default' to hide the fact which impl
> > we are using. This way we could even replace the impl in case we find a better one.
> >
> >>> - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding).
> >>>  This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards'
> >>
> >> To recap today we have
> >>    hibernate.search.[indexName].sharding_strategy = [implementation
> >> of IndexShardingStrategy]
> >>
> >> Would it not be nice if I could either specify an implementation of
> >> IndexShardingStrategy or a ShardIdentifierProvider ?
> >>    hibernate.search.[indexName].sharding_strategy = [implementation
> >> of IndexShardingStrategy | ShardIdentifierProvider]
> >
> > hmm, I have not thought about it this way. So far I was more thinking along the
> > lines of removing ShardIdentifierProvider. But you are proposing to keep it
> > and maybe in the long run remove IndexShardingStrategy?
> >
> >> -  being the same property keeps it clear that you can either specify
> >> one OR the other.
> >
> > Except that we are talking about to different interfaces. Hardly good practice to offer this type of
> > confguration.
> >
> > --Hardy
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev


More information about the hibernate-dev mailing list