[Search] Dynamic sharding configuration

4.3.0.Beta6 release date

Decoupling indexed types from...

Hardy Ferentschik

Friday, 20 September 2013 Fri, 20 Sep '13

5:37 a.m.

Hi, I am currently working on HSEARCH-471 [1] - dynamic sharding. The work is built on Emmanuel's prototype and you find the current code on my fork [2]. Right now I am wondering about how to configure (dynamic) sharding. Here is how things worked prior to dynamic sharding. Basically there two properties driving the shard configuration: - hibernate.search.[indexName].sharding_strategy - hibernate.search.[indexName].sharding_strategy.nbr_of_shards The first property determines the implementation class of IndexShardingStrategy and the second the number of shards to create. So far we had two implementations of IndexShardingStrategy, namely NotShardedStrategy and IdHashShardingStrategy. To configure sharding it was enough to set nbr_of_shards to a value > 1. This would automatically select IdHashShardingStrategy and shard depending on the configured number of shards. The idea was to make it simple to for the user and only require a single configuration change to enable sharding. However, it creates inconsistencies. For example what if I select NotShardedStrategy and nbr_of_shards >1? Or I set a custom sharding strategy which does not care about the number of shards? IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy. With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest: - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding). This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards' Thoughts? --Hardy [1] https://hibernate.atlassian.net/browse/HSEARCH-472 [2] https://github.com/hferentschik/hibernate-search/compare/HSEARCH-472

Show replies by date

Sanne Grinovero

Monday, 23 September Mon, 23 Sep

6:55 a.m.

On 20 September 2013 11:37, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

As a reminder to other readers: those are the two implementations included in Hibernate Search but the general expectation is that you plug in your own IndexShardingStrategy to make most of it.

...

To configure sharding it was enough to set nbr_of_shards to a value > 1. This would automatically select IdHashShardingStrategy and shard depending on the configured number of shards. The idea was to make it simple to for the user and only require a single configuration change to enable sharding. However, it creates inconsistencies. For example what if I select NotShardedStrategy and nbr_of_shards >1?

+1, should throw an exception. Currently it will initialize multiple shards but just use the first one; not too bad.

...

Or I set a custom sharding strategy which does not care about the number of shards?

I think that's far fetched. The NBR_OF_SHARDS option defines the size of the array of indexes passed to the IndexShardingStrategy so it's hard to ignore. Sure it's possible, we could throw a log.userIsAnIdiotException() but someone might not see the humor :-). Worst case it degenerates in a case similar to your example of "NotShardedStrategy and nbr_of_shards >1", or the user would notice with an ArrayOutOfBounds.. usercode problem.

...

IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy.

Note that so far we don't expect users to explicitly set the NotShardedStrategy: it's simply a consequence of not having set any option; if the user sets only the number of shards but omits picking a specific strategy, we automatically assume he's going for the IdHashShardingStrategy. As soon as a different IndexShardingStrategy is chosen, then I think it's quite self-explanatory that setting NBR_OF_SHARDS is quite useful: the user will have coded an explicit IndexShardingStrategy and consequentially have a clear idea of how many shards he wants, at least for the static sharding so far.

...

With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest: - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them

Conceptually it sounds nice. I see two downsides: - it pushes complexity to the IndexShardingStrategy implementor (the user) as he needs to parse it and somehow he needs to request those indexes from the SearchFactory to be built. Pushing both these responsibilities to the end user in exchange for a one-liner in the configuration file seems like an odd choice? I would agree if it was us to write the code, but I really expect most people to plug their own strategy as IdHashShardingStrategy isn't very useful in a real world app. - today we pre-initialize the indexes (IndexManagers) before they are passed to the IndexShardingStrategy # initialize method. We would need to pass instead some lifecycle-controlling objects which allows the user to trigger index initialization. Again I essentially agree but that sounds much like dynamic sharding? I don't think we can change these in the scope of 4.4 as it affects current API. Shall we take this inconsistency point as yet-another-reason to migrate to Dynamic Sharding? While the new feature matures, I suspect it could completely replace the static one. Let's see gradually?

...

- We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes

-1 : as I reminded above, I don't expect id-hash to be of practical use, people want to plug their own strategies which implies we need the concrete implementation classes. I'd rather see the IdHashShardingStrategy as a concrete example we're providing (not just an example, I guess someone might find it useful in production, I just think it's a minority of the IndexShardingStrategy users).

...

- For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding). This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards'

To recap today we have hibernate.search.[indexName].sharding_strategy = [implementation of IndexShardingStrategy] Would it not be nice if I could either specify an implementation of IndexShardingStrategy or a ShardIdentifierProvider ? hibernate.search.[indexName].sharding_strategy = [implementation of IndexShardingStrategy | ShardIdentifierProvider] In case a ShardIdentifierProvider is passed, it's obviously dynamic. In this case specifying a property for .nbr_of_shards would be ignored with a warning. The good thing of using this same property is that we can - gradually migrate from the static sharding without changing the props - being the same property keeps it clear that you can either specify one OR the other. -- Sanne

...

Thoughts? --Hardy [1] https://hibernate.atlassian.net/browse/HSEARCH-472 [2] https://github.com/hferentschik/hibernate-search/compare/HSEARCH-472 _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Hardy Ferentschik

7:58 a.m.

On 23 Jan 2013, at 1:55 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

> Or I set a custom sharding strategy which does not care about the number of shards? I think that's far fetched. The NBR_OF_SHARDS option defines the size of the array of indexes passed to the IndexShardingStrategy so it's hard to ignore. Sure it's possible, we could throw a log.userIsAnIdiotException() but someone might not see the humor :-). Worst case it degenerates in a case similar to your example of "NotShardedStrategy and nbr_of_shards >1", or the user would notice with an ArrayOutOfBounds.. usercode problem.

I think I am thinking about a custom dynamic sharding strategy in this case.

...

> IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy. Note that so far we don't expect users to explicitly set the NotShardedStrategy: it's simply a consequence of not having set any option; if the user sets only the number of shards but omits picking a specific strategy, we automatically assume he's going for the IdHashShardingStrategy

I get that, but as I said, I see it as a bit of misguided "ease of use", since it can introduce problems (see above) and is not consistent.

...

As soon as a different IndexShardingStrategy is chosen, then I think it's quite self-explanatory that setting NBR_OF_SHARDS is quite useful: the user will have coded an explicit IndexShardingStrategy and consequentially have a clear idea of how many shards he wants, at least for the static sharding so far.

Right, for static sharding maybe, but we are going to make this dynamic now.

...

> With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the > right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy > not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest: > > - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which > might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them Conceptually it sounds nice. I see two downsides: - it pushes complexity to the IndexShardingStrategy implementor (the user) as he needs to parse it and somehow he needs to request those indexes from the SearchFactory to be built. Pushing both these responsibilities to the end user in exchange for a one-liner in the configuration file seems like an odd choice? I would agree if it was us to write the code, but I really expect most people to plug their own strategy as IdHashShardingStrategy isn't very useful in a real world app.

I think that if the API to create an index manager is simple it basically unifies static and dynamic sharding.

...

- today we pre-initialize the indexes (IndexManagers) before they are passed to the IndexShardingStrategy # initialize method. We would need to pass instead some lifecycle-controlling objects which allows the user to trigger index initialization. Again I essentially agree but that sounds much like dynamic sharding?

With the right API both would be possible in a simple to understand way.

...

I don't think we can change these in the scope of 4.4 as it affects current API. Shall we take this inconsistency point as yet-another-reason to migrate to Dynamic Sharding? While the new feature matures, I suspect it could completely replace the static one.

I think so as well, in which case we need to make sure that we get the API right. A new/updated initialise contract might be exactly what we need instead of yet another patch. See also my email regarding ShardIdentifierProvider.

...

> - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes -1 : as I reminded above, I don't expect id-hash to be of practical use, people want to plug their own strategies which implies we need the concrete implementation classes. I'd rather see the IdHashShardingStrategy as a concrete example we're providing (not just an example, I guess someone might find it useful in production, I just think it's a minority of the IndexShardingStrategy users).

IdHashShardingStrategy is in use right now, at least if you enable sharding without any other specific IndexShardingStrategy implementation. Providing a short name of it is inline with configuration options like 'ram' or 'filesystem' for directory provides. We could use 'default' to hide the fact which impl we are using. This way we could even replace the impl in case we find a better one.

...

> - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding). > This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards' To recap today we have hibernate.search.[indexName].sharding_strategy = [implementation of IndexShardingStrategy] Would it not be nice if I could either specify an implementation of IndexShardingStrategy or a ShardIdentifierProvider ? hibernate.search.[indexName].sharding_strategy = [implementation of IndexShardingStrategy | ShardIdentifierProvider]

hmm, I have not thought about it this way. So far I was more thinking along the lines of removing ShardIdentifierProvider. But you are proposing to keep it and maybe in the long run remove IndexShardingStrategy?

...

- being the same property keeps it clear that you can either specify one OR the other.

Except that we are talking about to different interfaces. Hardly good practice to offer this type of confguration. --Hardy

Sanne Grinovero

5:04 p.m.

...

On 23 Jan 2013, at 1:55 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: >> Or I set a custom sharding strategy which does not care about the number of shards? > > I think that's far fetched. The NBR_OF_SHARDS option defines the size > of the array of indexes passed to the IndexShardingStrategy so it's > hard to ignore. Sure it's possible, we could throw a > log.userIsAnIdiotException() but someone might not see the humor :-). > Worst case it degenerates in a case similar to your example of > "NotShardedStrategy and nbr_of_shards >1", or the user would notice > with an ArrayOutOfBounds.. usercode problem. I think I am thinking about a custom dynamic sharding strategy in this case. >> IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy. > > Note that so far we don't expect users to explicitly set the > NotShardedStrategy: it's simply a consequence of not having set any > option; if the user sets only the number of shards but omits picking a > specific strategy, we automatically assume he's going for the > IdHashShardingStrategy I get that, but as I said, I see it as a bit of misguided "ease of use", since it can introduce problems (see above) and is not consistent. > As soon as a different IndexShardingStrategy is chosen, then I think > it's quite self-explanatory that setting NBR_OF_SHARDS is quite > useful: the user will have coded an explicit IndexShardingStrategy and > consequentially have a clear idea of how many shards he wants, at > least for the static sharding so far. Right, for static sharding maybe, but we are going to make this dynamic now. >> With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the >> right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy >> not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest: >> >> - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which >> might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them > > Conceptually it sounds nice. > I see two downsides: > - it pushes complexity to the IndexShardingStrategy implementor (the > user) as he needs to parse it and somehow he needs to request those > indexes from the SearchFactory to be built. Pushing both these > responsibilities to the end user in exchange for a one-liner in the > configuration file seems like an odd choice? I would agree if it was > us to write the code, but I really expect most people to plug their > own strategy as IdHashShardingStrategy isn't very useful in a real > world app. I think that if the API to create an index manager is simple it basically unifies static and dynamic sharding. > - today we pre-initialize the indexes (IndexManagers) before they are > passed to the IndexShardingStrategy # initialize method. We would need > to pass instead some lifecycle-controlling objects which allows the > user to trigger index initialization. Again I essentially agree but > that sounds much like dynamic sharding? With the right API both would be possible in a simple to understand way. > I don't think we can change these in the scope of 4.4 as it affects > current API. Shall we take this inconsistency point as > yet-another-reason to migrate to Dynamic Sharding? While the new > feature matures, I suspect it could completely replace the static one. I think so as well, in which case we need to make sure that we get the API right. A new/updated initialise contract might be exactly what we need instead of yet another patch. See also my email regarding ShardIdentifierProvider. >> - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes > > -1 : as I reminded above, I don't expect id-hash to be of practical > use, people want to plug their own strategies which implies we need > the concrete implementation classes. I'd rather see the > IdHashShardingStrategy as a concrete example we're providing (not just > an example, I guess someone might find it useful in production, I just > think it's a minority of the IndexShardingStrategy users). IdHashShardingStrategy is in use right now, at least if you enable sharding without any other specific IndexShardingStrategy implementation. Providing a short name of it is inline with configuration options like 'ram' or 'filesystem' for directory provides. We could use 'default' to hide the fact which impl we are using. This way we could even replace the impl in case we find a better one. >> - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding). >> This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards' > > To recap today we have > hibernate.search.[indexName].sharding_strategy = [implementation > of IndexShardingStrategy] > > Would it not be nice if I could either specify an implementation of > IndexShardingStrategy or a ShardIdentifierProvider ? > hibernate.search.[indexName].sharding_strategy = [implementation > of IndexShardingStrategy | ShardIdentifierProvider] hmm, I have not thought about it this way. So far I was more thinking along the lines of removing ShardIdentifierProvider. But you are proposing to keep it and maybe in the long run remove IndexShardingStrategy? > - being the same property keeps it clear that you can either specify > one OR the other. Except that we are talking about to different interfaces. Hardly good practice to offer this type of confguration. --Hardy

Hardy Ferentschik

Tuesday, 24 September Tue, 24 Sep

3:51 a.m.

On 24 Jan 2013, at 12:04 AM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

Correct me if I'm wrong, but trying to synthesize this discussion I think that we're fundamentally agreeing that dynamic sharding is a "better replacement" for static sharding.

It has the potential for a replacement I think, but I don't think we have found the best solution yet. Also I want to highlight "replacement". If we go for a ShardIdentifierProvider interface of sorts we should definitely deprecate IndexShardingStrategy.

...

Still, let's keep in mind that this needs to be a backwards compatible patch, so we're not looking for something disruptive of the current static sharding feature.

We should for sure try to keep API's stable. On the other hand I don't see why we should not be able to change SPI contracts. With this super restrictive behaviour we are seriously limiting our ability to move the software forward.

...

If we end up agreeing that the better API needs disruption, we should still make an incremental change available in this version - as experimental - so that people can play with it, then we're free to reset the API as we wish in 5.0 but at least including the newly acquired experience.

I keep hearing 5.0 as a some sort of catch all release. I think this thinking is very dangerous. Changing more than necessary is not good for the user and trying to attempt to fix all in one release will defer a potential 5.0 more and more, because there will be so many things we want to address. 5.0 should be about switching to Lucene 4. Once that is done we should release.

...

We then had some brainstorming on IRC which concluded that it would be probably more user friendly to have the __Strategy to: a) not return arrays but simple collections

I am proposing sets. They also have a semantic meaning.

...

b) avoid the confusion on the two too similar methods on the proposal for ShardIdentifierProvider (next paragraph)

...

c) not apply the sharding logic based on the Document (fields) but on the entity (the user type triggering the index event)

+1 I think there is a big potential in this idea.

...

#a Ok we can consider that but let's see how the code turns out. Iterable<String> perhaps?

Why not Set<String>?

...

#b we explored converging the two methods into the essential one: String getShardIdentifier(Class<?> entity, Serializable id, String idInString);

The single method could also look like this: Set<String> getShardIdentifiers(Class<?> entity, Serializable id, String idInString, Document document); where document could potentially be null.

...

but that seems very poor in terms of flexibility, it doesn't even allow access to the basic properties of the indexed entity. So that degenerated in the proposal #c,

Right, something like Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document); would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go String getShardIdentifier( Object entity, Serializable id, String idInString, Document document); If I understand correctly the reason for returning a set of shard identifiers for the case of deletion of documents is that in this case the user might not be able to determine the exact shard in which the document to delete is (since he does not have a Lucene document). If we pass the entity itself we should be fine for addition and deletion. Provided of course that we have the entity as deletion as well. However, that might not be the case (see below).

...

but before moving to #c I'd venture that these methods aren't that bad, they just need good documentation.

Is this what we are aiming for our days? "Not bad"? Mediocre? I want to be able to say that we have good interfaces and we did what we could to find the best solution possible.

...

Also, even if we move the focus from the Document to the Entity, we still don't have the fully loaded entity during a delete operation, so we would still need the second method returning multiple indexes.

Do we? Why? Would that not depend on the delete operation?

...

#c drawbacks first: = to expose the entity directly has probably some risk (the user making changes to it), but that's no different to what the user can already do from a FieldBridge / ClassBridge.

I don't see a drawback here. This is exactly what we do with class bridges. Giving the user access to the entity gives him most flexibility. Even using plain Hibernate / JPA APIs you have to comply with serval rules in order to make things work.

...

= won't be able to consider the output of FieldBridge / ClassBridge instances as you won't have the Document

We could pass the Document as well. We could call ShardIdentifierProvider just after building the document when we still have the entity itself and the document (in some cases it might be null of course).

...

= for deletion you still don't have the entity

Again, does this not depend on type of deletion?

...

We could list some conceptually interesting advantages here, but I'd like to shut down this feature for the time being because it's significantly different than the goal of providing Dynamic Sharding as a feature.

Why is this "significantly different"? Dynamic sharding is a new feature one way or another.

...

I'm not saying that it doesn't have merit: its probably worth exploring for 5+ but it rather seems like an _additional_ level of sharding that we might want to add in future as an alternative to the one focusing on the Document approach.

Again, it could be combination of both.

...

My position is that it worked well so far on static sharding, and that the proposal is quite consistent with it so wouldn't be much of a pain for people to adapt the new model.

Again, are we looking for a mediocre or a good solution?

...

Let's try polishing the method names (and use your imagination for a well written javadoc): String getShardIdFromDocument(Class<?> entity, Serializable id, String idInString, Document document); Iterable<String> getShardIdsFromId(Class<?> entity, Serializable id, String idInString); or maybe to highlight what's fundamentally different: String getShardIdFromContext(Class<?> entity, Serializable id, String idInString, Document document); Iterable<String> getShardIdsFromReducedContext(Class<?> entity, Serializable id, String idInString);

-1 to both. Here is another proposal. We remove String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); all together. Here is my reasoning. AFAIU, the method is there for the deletion of documents. In this case we don't have the Lucene document nor the entity and we need to know in which shard the document to delete is. The assumptions behind this method is that somehow given the type and id I am able to provide this shard or a subset of the shards. I doubt, however, that this is practically ever possible. In the end most implementations will have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or any other case where I shard depending on a given property of the entity. In this case I will never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String) In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case for that really? In which use case can the set of targeted shards be limited based on knowing the type of filers we apply? So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case) think about optimisations. The more I think about it, the more I like this more minimalistic approach.

...

Bonus dilemma: should we stay away from String and define some "IndexIdentifier" interface ?

Interesting idea, but that for sure is a disruptive change, since it will affect many APIs, unless you only introduce IndexIdentifier for the sharding case in which case I vote with a big -1 --Hardy

Sanne Grinovero

5:54 a.m.

On 24 September 2013 09:51, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

On 24 Jan 2013, at 12:04 AM, Sanne Grinovero <sanne(a)hibernate.org> wrote: > Correct me if I'm wrong, but trying to synthesize this discussion I > think that we're fundamentally agreeing that dynamic sharding is a > "better replacement" for static sharding. It has the potential for a replacement I think, but I don't think we have found the best solution yet. Also I want to highlight "replacement". If we go for a ShardIdentifierProvider interface of sorts we should definitely deprecate IndexShardingStrategy. > Still, let's keep in mind that this needs to be a backwards compatible > patch, so we're not looking for something disruptive of the current > static sharding feature. We should for sure try to keep API's stable. On the other hand I don't see why we should not be able to change SPI contracts. With this super restrictive behaviour we are seriously limiting our ability to move the software forward.

It's a consequence of having decided that this version is going to be very conservative. It won't be always like that, but the more we waste time fighting this decision the more we delay 5+ where you can do these things.

...

> If we end up agreeing that the better API > needs disruption, we should still make an incremental change available > in this version - as experimental - so that people can play with it, > then we're free to reset the API as we wish in 5.0 but at least > including the newly acquired experience. I keep hearing 5.0 as a some sort of catch all release. I think this thinking is very dangerous. Changing more than necessary is not good for the user and trying to attempt to fix all in one release will defer a potential 5.0 more and more, because there will be so many things we want to address. 5.0 should be about switching to Lucene 4. Once that is done we should release.

I hear you. Just saying that - on top of Lucene 4 - we will be able to re-discuss API/SPI. It's not mandatory we do that :-)

...

> We then had some brainstorming on IRC which concluded that it would be > probably more user friendly to have the __Strategy to: > a) not return arrays but simple collections I am proposing sets. They also have a semantic meaning.

Looks good. Devil's advocate: we don't want to allow multiple additions on the same index don't we?

...

> b) avoid the confusion on the two too similar methods on the proposal > for ShardIdentifierProvider (next paragraph) +1 > c) not apply the sharding logic based on the Document (fields) but on > the entity (the user type triggering the index event) +1 I think there is a big potential in this idea. > #a > Ok we can consider that but let's see how the code turns out. > Iterable<String> perhaps? Why not Set<String>?

...

> #b > we explored converging the two methods into the essential one: > > String getShardIdentifier(Class<?> entity, Serializable id, String > idInString); The single method could also look like this: Set<String> getShardIdentifiers(Class<?> entity, Serializable id, String idInString, Document document); where document could potentially be null.

Having a parameter as "potentially null" is imo horrible. Especially since we expect a Set with a single element only when the Document is not null. Sorry I wish we had a better idea, but given the alternatives I'm much preferring the status quo version. I don't think it's that bad, especially as I think we all agree it's better to code two short methods than a single more complex one.

...

> but that seems very poor in terms of flexibility, it doesn't even > allow access to the basic properties of the indexed entity. So that > degenerated in the proposal #c, Right, something like Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document); would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go

ah interesting you're thinking of a strategy whish has access to entity && document . I initially had thought only of one class of sharding-strategies working on document only, + one working on entity only. Your proposal is likely the most flexible one user-wise but we need to see if we can find an appropriate point in the transformation chain for this to be plugged; my only concern would be that it limits our internal design options significantly in the longer term. +1 to consider it in future. Will you open an issue for this?

...

String getShardIdentifier( Object entity, Serializable id, String idInString, Document document); If I understand correctly the reason for returning a set of shard identifiers for the case of deletion of documents is that in this case the user might not be able to determine the exact shard in which the document to delete is (since he does not have a Lucene document). If we pass the entity itself we should be fine for addition and deletion. Provided of course that we have the entity as deletion as well. However, that might not be the case (see below).

No we don't have the entity at deletion, you would have it only by forcing ORM to load entities which are deleted by id (including collections), which is not nice in terms of database roundtrips.

...

> but before moving to #c I'd venture > that these methods aren't that bad, they just need good documentation. Is this what we are aiming for our days? "Not bad"? Mediocre? I want to be able to say that we have good interfaces and we did what we could to find the best solution possible.

Since we don't have better options I indeed think it's the best solution possible. There are technical limitations for sure. [had not read mail until the bottom yet]

...

> Also, even if we move the focus from the Document to the Entity, we > still don't have the fully loaded entity during a delete operation, so > we would still need the second method returning multiple indexes. Do we? Why? Would that not depend on the delete operation?

You often don't have the fully loaded entity during a delete. It's potentially a proxy, having only the Id available. You could have a whole collection of these, and you might also be processing them in a phase where it's not allowed to trigger lazy loading... even disregarding performance concerns I'd rather not enter this arena, not in this version at least.

...

> #c > drawbacks first: > = to expose the entity directly has probably some risk (the user > making changes to it), but that's no different to what the user can > already do from a FieldBridge / ClassBridge. I don't see a drawback here. This is exactly what we do with class bridges. Giving the user access to the entity gives him most flexibility. Even using plain Hibernate / JPA APIs you have to comply with serval rules in order to make things work.

Ok, right. Being overzealous listing all drawbacks :)

...

> = won't be able to consider the output of FieldBridge / ClassBridge > instances as you won't have the Document We could pass the Document as well. We could call ShardIdentifierProvider just after building the document when we still have the entity itself and the document (in some cases it might be null of course).

Yes that sounds interesting, but : = how long will it take you to propose such a patch? = is it worth your time, considering that people have never complained on the limitations of the static sharding SPI (other than being static) ?

...

> = for deletion you still don't have the entity Again, does this not depend on type of deletion? > We could list some conceptually interesting advantages here, but I'd > like to shut down this feature for the time being because it's > significantly different than the goal of providing Dynamic Sharding as > a feature. Why is this "significantly different"? Dynamic sharding is a new feature one way or another.

you can try building it, but as above I'm concerned that this refactoring is more complex than you expect and not particularly useful for the end user. All what we're asked is to make the static sharding service support "dynamic shards". And additional requirement is that static sharding is working very well today and so we're not going to break it.

...

> I'm not saying that it doesn't have merit: its probably > worth exploring for 5+ but it rather seems like an _additional_ level > of sharding that we might want to add in future as an alternative to > the one focusing on the Document approach. Again, it could be combination of both.

...

> My position is that it worked well so far on static sharding, and that > the proposal is quite consistent with it so wouldn't be much of a pain > for people to adapt the new model. Again, are we looking for a mediocre or a good solution? > Let's try polishing the method > names (and use your imagination for a well written javadoc): > > String getShardIdFromDocument(Class<?> entity, Serializable id, > String idInString, Document document); > > Iterable<String> getShardIdsFromId(Class<?> entity, Serializable > id, String idInString); > > or maybe to highlight what's fundamentally different: > > String getShardIdFromContext(Class<?> entity, Serializable id, > String idInString, Document document); > > Iterable<String> getShardIdsFromReducedContext(Class<?> entity, > Serializable id, String idInString); -1 to both. Here is another proposal. We remove String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); all together. Here is my reasoning. AFAIU, the method is there for the deletion of documents. In this case we don't have the Lucene document nor the entity and we need to know in which shard the document to delete is. The assumptions behind this method is that somehow given the type and id I am able to provide this shard or a subset of the shards. I doubt, however, that this is practically ever possible. In the end most implementations will have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or any other case where I shard depending on a given property of the entity. In this case I will never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String)

Very good point. It's sad to admit that - considering sharding as a scalability tool - it doesn't scale on delete throughput. (I'm stressing "throughput" as technically people using sharding do it primarily to keep index size reasonable) Loving the idea, let's remove that method from the dynamic sharding options.

...

In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case for that really? In which use case can the set of targeted shards be limited based on knowing the type of filers we apply?

This is a very usefull feature: http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#q...

...

So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case) think about optimisations.

+1/2 : let's remove the one for deletions, which I think is the main pain point, but keep the one for queries as it is actually important.

...

The more I think about it, the more I like this more minimalistic approach.

...

> Bonus dilemma: should we stay away from String and define some > "IndexIdentifier" interface ? Interesting idea, but that for sure is a disruptive change, since it will affect many APIs, unless you only introduce IndexIdentifier for the sharding case in which case I vote with a big -1

Agreed.

...

--Hardy _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Hardy Ferentschik

7:12 a.m.

On 24 Jan 2013, at 12:54 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

> We should for sure try to keep API's stable. On the other hand I don't see why we should not be > able to change SPI contracts. With this super restrictive behaviour we are seriously limiting > our ability to move the software forward. It's a consequence of having decided that this version is going to be very conservative. It won't be always like that, but the more we waste time fighting this decision the more we delay 5+ where you can do these things.

I don't see why we should treat this release different from others regarding policies on do and don't do between versions. AFAIU an SPI change between 4.3 and 4.4 is within our policies.

...

>> We then had some brainstorming on IRC which concluded that it would be >> probably more user friendly to have the __Strategy to: >> a) not return arrays but simple collections > > I am proposing sets. They also have a semantic meaning. Looks good. Devil's advocate: we don't want to allow multiple additions on the same index don't we?

What would that even mean? So instead of a set you are saying it would make sense to have a list (or array) in which a given shard appears more than once? I don't see a use for this?

...

>> but that seems very poor in terms of flexibility, it doesn't even >> allow access to the basic properties of the indexed entity. So that >> degenerated in the proposal #c, > > Right, something like > > Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document); > > would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go ah interesting you're thinking of a strategy whish has access to entity && document . I initially had thought only of one class of sharding-strategies working on document only, + one working on entity only.

No, I thought indeed that we could make both available.

...

>> Also, even if we move the focus from the Document to the Entity, we >> still don't have the fully loaded entity during a delete operation, so >> we would still need the second method returning multiple indexes. > > Do we? Why? Would that not depend on the delete operation? You often don't have the fully loaded entity during a delete. It's potentially a proxy, having only the Id available. You could have a whole collection of these, and you might also be processing them in a phase where it's not allowed to trigger lazy loading... even disregarding performance concerns I'd rather not enter this arena, not in this version at least.

Ok, fair enough. That's of course a bit of a down turn. However, it also shows that no matter how we turn and twist it, for deletion we will have to target all shards (except maybe for some contrived border cases)

...

> Here is another proposal. We remove > > String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); > > all together. Here is my reasoning. AFAIU, the method is there for the deletion of > documents. In this case we don't have the Lucene document nor the entity and we need > to know in which shard the document to delete is. The assumptions behind this method is > that somehow given the type and id I am able to provide this shard or a subset of the shards. > I doubt, however, that this is practically ever possible. In the end most implementations will > have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or > any other case where I shard depending on a given property of the entity. In this case I will > never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String) Very good point. It's sad to admit that - considering sharding as a scalability tool - it doesn't scale on delete throughput. (I'm stressing "throughput" as technically people using sharding do it primarily to keep index size reasonable) Loving the idea, let's remove that method from the dynamic sharding options.

:-) Now we are getting somewhere.

...

> In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case > for that really? In which use case can the set of targeted shards be limited based on knowing the > type of filers we apply? This is a very usefull feature: http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#q...

...

> So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much > simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case) > think about optimisations. +1/2 : let's remove the one for deletions, which I think is the main pain point, but keep the one for queries as it is actually important.

ok, but it would return a Set<String> right? Also should we still rename getShardIdentifier to getShardIdentifierForAddition? Let's sum the things up which are needed for finishing dynamic sharding 1) deprecate IndexShardingStrategy 2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider 3) change return types for getShardIdentifiersForQuery and getAllShardIdentifiers to Set<String> 4) remove the sub-property 'shard_identity_provider'. ShardIdentiferProvider is configured using the existing 'hibernate.search.sharding_strategy' we will need to make the appropriate type checks to instantiate the right things (is this really a good idea?) 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition' 6) reflect all this in the docs Agreed? --Hardy

Sanne Grinovero

7:30 a.m.

On 24 September 2013 14:12, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

On 24 Jan 2013, at 12:54 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: >> We should for sure try to keep API's stable. On the other hand I don't see why we should not be >> able to change SPI contracts. With this super restrictive behaviour we are seriously limiting >> our ability to move the software forward. > > It's a consequence of having decided that this version is going to be > very conservative. > It won't be always like that, but the more we waste time fighting this decision > the more we delay 5+ where you can do these things. I don't see why we should treat this release different from others regarding policies on do and don't do between versions. AFAIU an SPI change between 4.3 and 4.4 is within our policies. >>> We then had some brainstorming on IRC which concluded that it would be >>> probably more user friendly to have the __Strategy to: >>> a) not return arrays but simple collections >> >> I am proposing sets. They also have a semantic meaning. > > Looks good. Devil's advocate: we don't want to allow multiple > additions on the same index don't we? What would that even mean? So instead of a set you are saying it would make sense to have a list (or array) in which a given shard appears more than once? I don't see a use for this? >>> but that seems very poor in terms of flexibility, it doesn't even >>> allow access to the basic properties of the indexed entity. So that >>> degenerated in the proposal #c, >> >> Right, something like >> >> Set<String> getShardIdentifiers( Object entity, Serializable id, String idInString, Document document); >> >> would be better (w/ or w/o Document!?). If we have access to the entity we maybe can even go > > ah interesting you're thinking of a strategy whish has access to > entity && document . > I initially had thought only of one class of sharding-strategies > working on document only, + one working on entity only. No, I thought indeed that we could make both available. >>> Also, even if we move the focus from the Document to the Entity, we >>> still don't have the fully loaded entity during a delete operation, so >>> we would still need the second method returning multiple indexes. >> >> Do we? Why? Would that not depend on the delete operation? > > You often don't have the fully loaded entity during a delete. It's > potentially a proxy, having only the Id available. > You could have a whole collection of these, and you might also be > processing them in a phase where it's not allowed to trigger lazy > loading... even disregarding performance concerns I'd rather not enter > this arena, not in this version at least. Ok, fair enough. That's of course a bit of a down turn. However, it also shows that no matter how we turn and twist it, for deletion we will have to target all shards (except maybe for some contrived border cases) >> Here is another proposal. We remove >> >> String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); >> >> all together. Here is my reasoning. AFAIU, the method is there for the deletion of >> documents. In this case we don't have the Lucene document nor the entity and we need >> to know in which shard the document to delete is. The assumptions behind this method is >> that somehow given the type and id I am able to provide this shard or a subset of the shards. >> I doubt, however, that this is practically ever possible. In the end most implementations will >> have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or >> any other case where I shard depending on a given property of the entity. In this case I will >> never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String) > > Very good point. > It's sad to admit that - considering sharding as a scalability tool - > it doesn't scale on delete throughput. > (I'm stressing "throughput" as technically people using sharding do it > primarily to keep index size reasonable) > > Loving the idea, let's remove that method from the dynamic sharding options. :-) Now we are getting somewhere. >> In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case >> for that really? In which use case can the set of targeted shards be limited based on knowing the >> type of filers we apply? > > This is a very usefull feature: > http://docs.jboss.org/hibernate/search/4.4/reference/en-US/html_single/#q... ok >> So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much >> simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case) >> think about optimisations. > > +1/2 : let's remove the one for deletions, which I think is the main > pain point, but keep the one for queries as it is actually important. ok, but it would return a Set<String> right? Also should we still rename getShardIdentifier to getShardIdentifierForAddition? Let's sum the things up which are needed for finishing dynamic sharding 1) deprecate IndexShardingStrategy

+1 I'm not fully sure if the static one doesn't deserve existing anymore, especially since we just decided that the dynamic sharding will not allow fine-level control on deletetions for example. But let's deprecate it and move on, worst case we can change our mind, or at least it might generate the feedback we need.

...

2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider

+1 we're automatically assuming a deletion needs to be routed to all identifiers.

...

3) change return types for getShardIdentifiersForQuery and getAllShardIdentifiers to Set<String>

+1 I have to admit I was having bad feelings about this Set<> because of this being a hotspot and I would love to get rid of object allocations, *especially* heavy objects such as Set. The good thing about the other method removal is that it basically removes the need for dynamic construction of such a Set contents: this one can easily be cached by an implementor.

...

4) remove the sub-property 'shard_identity_provider'. ShardIdentiferProvider is configured using the existing 'hibernate.search.sharding_strategy' we will need to make the appropriate type checks to instantiate the right things (is this really a good idea?)

+1 Doesn't look hard, let's try.

...

5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition'

Is that needed? I thought that by removing the conflicting method there would be no further need to clarify the method. I'd propose to keep the method name as is, we still have the javadocs asset to clarify how this is all being used for the apprehensive user.

...

6) reflect all this in the docs

...

Agreed? --Hardy _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Hardy Ferentschik

7:46 a.m.

Cool, we seem to agree on almost everything now :-) On 24 Jan 2013, at 2:30 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

> 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition' Is that needed? I thought that by removing the conflicting method there would be no further need to clarify the method. I'd propose to keep the method name as is, we still have the javadocs asset to clarify how this is all being used for the apprehensive user.

Not necessarily. Just thought it is more symmetric with getShardIdentifierForQuery. The name implies to a certain degree where the method is used. I can go either way on this one. --Hardy

Sanne Grinovero

8:16 a.m.

On 24 September 2013 13:46, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

Cool, we seem to agree on almost everything now :-)

+1 it's hard to get convergence when the thread explodes exponentially on several different subjects but it seems it was worth it. thanks for the huge energy and ideas :-)

...

On 24 Jan 2013, at 2:30 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: >> 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition' > > Is that needed? I thought that by removing the conflicting method > there would be no further need to clarify the method. I'd propose to > keep the method name as is, we still have the javadocs asset to > clarify how this is all being used for the apprehensive user. Not necessarily. Just thought it is more symmetric with getShardIdentifierForQuery. The name implies to a certain degree where the method is used. I can go either way on this one.

Let's keep the current name. I'm pretty sure we won't use it just for additions (updates for one), as it really is *the* method defining the relationship function Document -> shard. The name of getShardIdentifierForQuery is exceptional because it really is geared for Query, providing filtering capabilities. Cheers, Sanne

...

--Hardy _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Hardy Ferentschik

8:56 a.m.

On 24 Jan 2013, at 3:16 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

On 24 September 2013 13:46, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > Cool, we seem to agree on almost everything now :-) +1 it's hard to get convergence when the thread explodes exponentially on several different subjects but it seems it was worth it. thanks for the huge energy and ideas :-)

I agree. It was quite some effort, but it was worth it.

...

>>> 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition' >> >> Is that needed? I thought that by removing the conflicting method >> there would be no further need to clarify the method. I'd propose to >> keep the method name as is, we still have the javadocs asset to >> clarify how this is all being used for the apprehensive user. > > Not necessarily. Just thought it is more symmetric with getShardIdentifierForQuery. > The name implies to a certain degree where the method is used. I can go > either way on this one. Let's keep the current name. I'm pretty sure we won't use it just for additions (updates for one), as it really is *the* method defining the relationship function Document -> shard.

...

The name of getShardIdentifierForQuery is exceptional because it really is geared for Query, providing filtering capabilities.

hmm --Hardy

Sanne Grinovero

9:33 a.m.

On 24 September 2013 14:56, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

On 24 Jan 2013, at 3:16 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: > On 24 September 2013 13:46, Hardy Ferentschik <hardy(a)hibernate.org> wrote: >> Cool, we seem to agree on almost everything now :-) > > +1 it's hard to get convergence when the thread explodes exponentially > on several different subjects but it seems it was worth it. > thanks for the huge energy and ideas :-) I agree. It was quite some effort, but it was worth it. >>>> 5) rename 'getShardIdentifier' to 'getShardIdentifierForAddition' >>> >>> Is that needed? I thought that by removing the conflicting method >>> there would be no further need to clarify the method. I'd propose to >>> keep the method name as is, we still have the javadocs asset to >>> clarify how this is all being used for the apprehensive user. >> >> Not necessarily. Just thought it is more symmetric with getShardIdentifierForQuery. >> The name implies to a certain degree where the method is used. I can go >> either way on this one. > > Let's keep the current name. I'm pretty sure we won't use it just for > additions (updates for one), > as it really is *the* method defining the relationship function > Document -> shard. ok > The name of getShardIdentifierForQuery is exceptional because it > really is geared for Query, > providing filtering capabilities. hmm

come on, the "agreement progress bar" got stuck at 99,97% ?? :-D

...

--Hardy

Emmanuel Bernard

Wednesday, 2 October Wed, 2 Oct

8:34 a.m.

On Tue 2013-09-24 14:30, Sanne Grinovero wrote:

...

On 24 September 2013 14:12, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > 2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider +1 we're automatically assuming a deletion needs to be routed to all identifiers.

Bad idea as I explained in my previous email. Plus we could already make use of that if we reuse Hibernate ORM's tenantid value.

Sanne Grinovero

Monday, 7 October Mon, 7 Oct

10:03 a.m.

On 2 October 2013 14:34, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

On Tue 2013-09-24 14:30, Sanne Grinovero wrote: > On 24 September 2013 14:12, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > > 2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider > > +1 we're automatically assuming a deletion needs to be routed to all > identifiers. Bad idea as I explained in my previous email. Plus we could already make use of that if we reuse Hibernate ORM's tenantid value.

I've tried hard to find an agreement on this, but it seems we're wasting time without making progress. I'm not happy in ignoring a strong recommendation from any of you, very hard choice :-( Hardy are you going to reconnect later? Could you reply to this email of Emmanuel? I'm inclined to add the method back, so that it's the users choice to pick his battle. As mentioned below, I don't think we should take options away from them. Of course our template implementation could provide a sensible default method, so all users looking for simplicity don't need to bother too much about the extra method. @Emmanuel the last conversation we had on the subject is below: --Sanne [15:05] <sannegrinovero> hardy: on the dynamic shard id concerns rised by emmanuel, I thought you where going to propose a pair of method names that would suite you? [15:05] <sannegrinovero> I still think we need to restore the missing method. [15:05] <hardy> I don't think so [15:06] <sannegrinovero> emmanuel had quite a strong opinion about it, don't think it's fair to ignore that. [15:06] <hardy> I thought more about it and I think the use case is not even imlementable [15:06] <sannegrinovero> you have a point there. [15:06] <hardy> well, I think he is wrong [15:06] <hardy> take his use case [15:06] <hardy> he wants to use some sort of customer id or ORM shard identifier [15:07] <hardy> sounds great, but there is no such context to get it from [15:07] <hardy> so what can you do in this case? [15:07] <hardy> a ThreadLocal [15:07] <hardy> and that's exactly the problem [15:07] <hardy> if the shards id were determined at the document built time (as we want to do it ), it might be possible [15:07] <sebersole> amazing how often "context" is problematic :) [15:08] <hardy> he he [15:08] <sannegrinovero> no. as far as I understood,he was planning to get a reference to the Strategy, and then invoke setters on it to "program" the thing. [15:08] <hardy> but now the shard identifiers are "generated" when the changes are getting applied to the index [15:09] <hardy> that's happening on a different thread [15:09] <sannegrinovero> hardy: we're going in circles with this debate on abstract hypothesis. Emmanuel said he has a use case for it, and implemented it. that should be good enough for us? [15:09] <hardy> no [15:09] <hardy> I honestly would like to see the code first [15:09] <sannegrinovero> :-) let me try a proposal [15:10] <hardy> and how does it work in a clustered environment [15:10] <hardy> or JMS [15:10] <hardy> I seriously doubt we can implement this in a decent way atm [15:10] <sannegrinovero> WDYT of this plan: we re-introduce the method, and provide the abstract base class I've made; then the deletion method has a default implementation. [15:11] <hardy> one beauty of the new interface is, is that it is simpler [15:11] <hardy> and imo it removes something which was conceptually not working anyways [15:11] <sannegrinovero> then in future we can deprecate this method. [15:11] <hardy> I don't see a point of re-introducing it unless someone can actually provide a working example [15:12] <sannegrinovero> he made one already, he just couldn't show it, but described it. [15:12] <sannegrinovero> it's much easier to remove a method from an SPI interface. [15:12] <hardy> no [15:12] <hardy> so what was his example? [15:13] <sannegrinovero> I told you, he takes the reference to the sharding Strategy, and sets the context explicitly. [15:13] <hardy> again, how [15:14] <hardy> explain me how this is going to work on e.g. a JMS master [15:14] <sannegrinovero> hardy: org.hibernate.search.engine.spi.EntityIndexBinding.getShardIdentifierProvider() gives you access to it. [15:16] <sannegrinovero> hardy: consider that this feature is power-users only. Some will definitely want to control the deletion. Some will do it wrong, well that's not my problem. [15:17] <sannegrinovero> hardy: but it definitely is annoying to not provide *any* way for a user to hack his way into it. [15:17] <hardy> sannegrinovero: yes, there is EntityIndexBinding.getShardIdentifierProvider() [15:17] <hardy> but how would you use that on a JMS master [15:17] <hardy> you still need to know what to set [15:17] <sannegrinovero> hardy: I don't feel it's our right to consider our users dumb, some will definitely have smarter ideas. [15:17] <hardy> how would you determine the customer id on the JMS master [15:18] <hardy> I don't consider them as dumb at all [15:18] <hardy> but write me a test or example setup [15:18] <sannegrinovero> hardy: I can attach lots of custom attributes in a JMS message. I can do it from my custom backend, or even use routing options if I have something like Camel. [15:19] <sannegrinovero> hardy: you actually made an excellent use case with JMS :) [15:19] <sannegrinovero> hardy: but I'm not going to code a full JMS + Camel app to show you :-D [15:23] <hardy> sannegrinovero: and how do you create a custom message? [15:23] <hardy> how is this all wired up? [15:24] <sannegrinovero> hardy: that's system setup. For example, the shard id could be selected by the originating machine: the routing process of the JMS message could add this as context. [15:24] <hardy> but that's not how it works atm [15:24] <sannegrinovero> hardy: say I have an EAP6 instance per shard running as client, and have a single master shared across them. [15:25] <sannegrinovero> hardy: that works today. We're not controlling how the messages are sent around in a SOA environment. [15:25] <hardy> whatever [15:26] <sannegrinovero> WDYM ? [15:26] <sannegrinovero> it's a powerful use case, I'm not feeling comfortable in denying the option. [15:26] <hardy> apparently you want it back, but you also are not providing a working example [15:26] <hardy> it is contrieved [15:27] <hardy> and why would you return a set of shard ids [15:27] <hardy> as in the original proposal [15:27] <hardy> at least it should be a single id as well [15:30] <sannegrinovero> hardy: I never disagreed with you about the method name not being ideal. But this JMS integration point got me quite excited now on the routing options it provides. [15:30] <sannegrinovero> I mean, the API really feels lacking without the method for deletions. [15:30] <hardy> not at all [15:31] <hardy> but suite yourself. I got to go anyways [15:31] <-- hardy (~hardy@redhat/jboss/hardy) has left this server (Quit: bye bye). [15:37] <sannegrinovero> gmorling: I'm puzzled about the dyn shard SPI. Don't like to take an action with hardy being so fiercely against it. WDYT? You know if he'll be back soon? [15:38] <gmorling> sannegrinovero: no, unfortunately i don't know when he'll be back; would be nice though to come to a commonly agreed upon solution [15:41] <sannegrinovero> gmorling: right, which makes it even more important to make sure that we're not dropping existing use cases: the existing one is able to pick a single delete index. [15:42] <gmorling> yes, if SIP is intended as replacement it shouldn't offer less functionality (given the previously functionality was sound)

Hardy Ferentschik

12:26 p.m.

On 7 Jan 2013, at 5:03 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

I've tried hard to find an agreement on this, but it seems we're wasting time without making progress. I'm not happy in ignoring a strong recommendation from any of you, very hard choice :-(

In the end it is your call. I tried to give arguments for my position, but we seem to have general disagreement on how to develop/evolve an interface. If you want to have a specific method for deletion I recommend: String getShardIdentifierForAddition(Class<?> entityType, Serializable id, String idAsString, Document document); String getShardIdentifierDeletion(Class<?> entityType, Serializable id, String idAsString); So I would re-add the suffixes 'ForAddition' and 'ForDeletion'. Also I'd change the return type of getShardIdentifierDeletion. I test would be nice as well. Maybe if we see an actual example coded out we would have a better ground for discussion. Also, what are the concerns here? Performance, because I target all shards for deletion or security, aka a deletion is send to a shard which potentially belongs to a different customer. What is the actual performance gain between the two different scenarios? A factor of 2, 4, 10?

...

I'm inclined to add the method back, so that it's the users choice to pick his battle.

A poor choice imo.

...

Of course our template implementation could provide a sensible default method, so all users looking for simplicity don't need to bother too much about the extra method.

So now we use the abstract template class as an excuse to have a mediocre interface. If anything this is a reason for me to dislike the template implementation even more. --Hardy

Sanne Grinovero

3:06 p.m.

Hi Hardy, could you have a look at the following two commits, while I work on a test as you suggested. (documentation will follow depending on which one you like best). In this case I just add the missing method, and I don't think it's bad, actually the name is fine and while I'm sure you might have some ideas on the javadoc, I think it's relatively clear what it all means: https://github.com/Sanne/hibernate-search/commit/9a1a542e551784565e6536c2... In the following commit, which requires the previous one, I'm introducing a new interface AdvancedShardIdentifierProvider. For as much as I don't like having too many SPIs, I think we agree that this one is addressing power users only. No configuration changes are needed, the documented example stays fine as it is: the user just has the choice of optionally implementing the more advanced interface, we'll pick it up from there with a simple "intanceof". https://github.com/Sanne/hibernate-search/commit/0f22a594075ae7364d8daf72... I'll work on a test, and docs update as soon as you can let me know which approach you prefer. Addressing some of your comments below: On 7 October 2013 18:26, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

On 7 Jan 2013, at 5:03 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: > I've tried hard to find an agreement on this, but it seems we're > wasting time without making progress. > I'm not happy in ignoring a strong recommendation from any of you, > very hard choice :-( In the end it is your call. I tried to give arguments for my position, but we seem to have general disagreement on how to develop/evolve an interface. If you want to have a specific method for deletion I recommend: String getShardIdentifierForAddition(Class<?> entityType, Serializable id, String idAsString, Document document); String getShardIdentifierDeletion(Class<?> entityType, Serializable id, String idAsString); So I would re-add the suffixes 'ForAddition' and 'ForDeletion'. Also I'd change the return type of getShardIdentifierDeletion.

The return type will be _Set<String>_. 'ForAddition' is misleading as it's not used just for additions, I think we discussed these already.

...

I test would be nice as well. Maybe if we see an actual example coded out we would have a better ground for discussion.

I'll make one, but I hope you'll not be too severe: a fully fledged example would take much time, I hope to find one which is somehow providing the intuition, but not the full JMS routing example I mentioned earlier which is IMHO the strongest advocate.

...

Also, what are the concerns here? Performance, because I target all shards for deletion or security, aka a deletion is send to a shard which potentially belongs to a different customer.

Right performance is a strong point, but when dealing with multitenancy you might also have legal requirements; DOSing a different tenant might be a violation of terms.

...

What is the actual performance gain between the two different scenarios? A factor of 2, 4, 10?

Depends, the benefit is obviously proportional on how many shards you have and how often you delete :-) Remember that with NRT we can do writes relatively quickly, but a delete will always require a disk sync. A disk sync is a very strong barrier of course, we we're more likely in the area of 3 to 4 orders of magnitude for a delete vs. a write. Of course we would still have a delete, but on less indexes. You could think then that the cost is just a factor of how many shards, but consider also that the above cost is not actually paid for during the delete flush, but at query time: the query will trigger a pre-execution flush. So there are cases in which I might be sending deletes on index A, and running queries on index B, having indexes on B *much* faster because of the non-sync happening. On top of this you have to account for FieldCaches to be invalidated because the index is dirty, or FullTextFilters which need to be re-computed unnecessarily. It can all pile up, interacting with each other, making quite an ugly difference. Granted, I'd have to make a very unrealistic test to highlight it, so let's keep it to the theory, but I think that you can easily appreciate how it can make a significant difference in complex real world applications. Cool idea: Considering this all, I guess an interesting use case is to use a Strategy which always returns an empty set for deletions. Some people like to reindex at night to keep good performance over the day and disable our event listeners, to run the MassIndexer overnight. I guess a nice tradeoff would be to just skip delete work on the index: Hibernate Search won't return non-existing matches anyway, and the garbage would be cleaned up with the MassIndexer run overnight, but at least you would still have updates applied in real time. Might need some validation, but considering the IO cost of deletions (and indirect cost on filters and caches) I would seriously have considered such an approach as a user. --Sanne

Sanne Grinovero

5:56 p.m.

I've included an example which represents a good reason to provide the controversial method. Technically the test is crafted as a static sharding approach but is using the new API; you can easily figure the same case for a dynamic sharding case; also considering we're deprecating the older static sharding API, this should also be able to replace whatever was possible before. Could you have a look, reject one and review/merge the other: - https://github.com/hibernate/hibernate-search/pull/501 - https://github.com/hibernate/hibernate-search/pull/502 I personally have a mild preference for "proposal A", as I think the additional interface introduced in "proposal B" doesn't simplify but adds an additional mark on the list of things the user needs to learn about, but I'm happy with both solutions as long as we move on with one. We'll update documentation in a second PR, depending which way we go :-) Cheers, Sanne

Gunnar Morling

Tuesday, 8 October Tue, 8 Oct

3:08 a.m.

Sanne, As you say adding yet another interface makes things even more difficult to grok; So I'd vote for adding the method for the deletion use case to SIP directly. I'm not sure whether it has been considered before, but maybe we could unify the methods and work with a parameter object as a middle ground: public interface ShardIdentifierProvider { void initialize(Properties properties, BuildContext buildContext); Set<String> getShardIdentifiersForEntity(EntityShardIdentifierRetrievalContext retrievalContext); Set<String> getShardIdentifiersForQuery(QueryShardIdentifierRetrievalContext retrievalContext); Set<String> getAllShardIdentifiers(); } EntityShardIdentifierRetrievalContext would provide all parameters usable for shard determination, clearly stating that "document" is not available in cases such as deleting. Such a parameter object would allow for adding more options in the future in a compatible manner, and also the method names read quite nicely and symmetrically (I share Hardy's concerns about the asymmetry of getShardIdentifier() vs. getShardIdentifiersForDeletion()). The disadvantage of this scheme is that a set needs to be returned also for the case of returning a single identifier during insert/update, which might render the approach not feasible. OTOH I'm wondering why a set needs to be returned for the delete case, your example also returns exactly one identifier? --Gunnar 2013/10/8 Sanne Grinovero <sanne(a)hibernate.org>

...

Hardy Ferentschik

4:03 a.m.

On 8 Jan 2013, at 10:08 AM, Gunnar Morling <gunnar(a)hibernate.org> wrote:

...

I'm not sure whether it has been considered before, but maybe we could unify the methods and work with a parameter object as a middle ground:

It has been. I suggested before to combine the methods. I think it is a good approach, but Sanne thought it is bad that he user has to deal with null values.

...

public interface ShardIdentifierProvider { void initialize(Properties properties, BuildContext buildContext); Set<String> getShardIdentifiersForEntity(EntityShardIdentifierRetrievalContext retrievalContext); Set<String> getShardIdentifiersForQuery(QueryShardIdentifierRetrievalContext retrievalContext); Set<String> getAllShardIdentifiers(); }

It is a reasonable approach. Better then what we have. I am drawn between this one and the additional interface. This approach has the advantage that as you say we can modify the internas of the context instances.

...

EntityShardIdentifierRetrievalContext would provide all parameters usable for shard determination, clearly stating that "document" is not available in cases such as deleting. Such a parameter object would allow for adding more options in the future in a compatible manner, and also the method names read quite nicely and symmetrically (I share Hardy's concerns about the asymmetry of getShardIdentifier() vs. getShardIdentifiersForDeletion()).

All correct imo.

...

The disadvantage of this scheme is that a set needs to be returned also for the case of returning a single identifier during insert/update, which might render the approach not feasible. OTOH I'm wondering why a set needs to be returned for the delete case, your example also returns exactly one identifier?

See the comments I made on the pull requests. I also don't think the deletion case should return a set. Either one knows the shard it and returns it or one does not know and you have to return null. In the latter case it is up to us to apply the default strategy. IMO it is wrong to say to the user "if you don't know just return all shard ids". It takes away the option from us to distinguish between these two cases. --Hardy

Sanne Grinovero

4:56 a.m.

We'll need to time cap this discussion as we're way too late, of course this will need to be tagged @experimental. Having said that, let's try to find the best proposal possible by lunch time, as one of the approaches needs to be merged: it's very clear that there is big practical value for the user to narrow down deletions. The parameter object approach got my interest but I'll need an example, including the user code implementing the methods, as I suspect a terrible result. I'm pretty sure I would implement this API with small and short methods. As an implementor of such an interface, the first thing I would do is to figure if I'm in the delete case or not - since it's special - and branch off between two methods, after writing the condition check. So I already have 3 methods, and a condition which I could copy paste from the book. End result, I still have to implement the two methods I'm proposing (in addition!), but first I need to understand that "some parameters might be missing" at runtime. That's an absolutely bad idea: using null as a canary token in a method is awful, especially if it's our code calling into user code. The method has to be a clear contract even without reading the javadoc, providing such an explanation in English prose is by far not as clear; defining a contract of two different contexts - even maybe just overloading the same method - is a much clearer set of instructions to the implementor. On 8 October 2013 10:03, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

In the mail I've sent yesterday I've described an interesting use case to return an empty set for nightly index rebuilds. That should highlight that depending on the scenario there is a full range of options, from the empty set to the universe set. We're clearly talking about sets. BTW It was your proposal initially to use Set, that was a very good one. I don't see the need for us to "distinguish between these two cases", for one because there are more than two cases, but especially as we have no use of making this distinction. Sanne

Hardy Ferentschik

5:30 a.m.

On 8 Jan 2013, at 11:56 AM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

Having said that, let's try to find the best proposal possible by lunch time, as one of the approaches needs to be merged:

I think we very well could go w/ the current state right now and evolve in future versions. I would like to explore Gunnar's idea for example. And of course the whole idea of determining the shard id when the document is build. Given that the old approach is just deprecated nothing is lost for the user yet. We would just be more conservative on what we offer right now.

...

The parameter object approach got my interest but I'll need an example, including the user code implementing the methods, as I suspect a terrible result.

What would be terrible about it? In fact if we do the shard id determination earlier you might in some cases even have an entity in the deletion case.

...

I'm pretty sure I would implement this API with small and short methods. As an implementor of such an interface, the first thing I would do is to figure if I'm in the delete case or not - since it's special - and branch off between two methods, after writing the condition check. So I already have 3 methods, and a condition which I could copy paste from the book. End result, I still have to implement the two methods I'm proposing (in addition!), but first I need to understand that "some parameters might be missing" at runtime. That's an absolutely bad idea: using null as a canary token in a method is awful, especially if it's our code calling into user code.

We just have to disagree on this one. I find the API as proposed confusing. Simple string returns via sets and confusing names. I would rather have a context object with a cleat semantic.

...

The method has to be a clear contract even without reading the javadoc, providing such an explanation in English prose is by far not as clear;

What is the difference to what you are proposing? String getShardIdentifier(Class<?> entityType, Serializable id, String idAsString, Document document); Set<String> getShardIdentifier(Class<?> entityType, Serializable id, String idAsString); So you are saying one can determine the use of each of the method just by its contract?

...

defining a contract of two different contexts - even maybe just overloading the same method - is a much clearer set of instructions to the implementor.

Not always. --Hardy

Sanne Grinovero

6:38 a.m.

Guys let's put this into perspective. These arguments I'm hearing against adding a method in a power-user oriented SPI are way outbalancing the harm they do to the project in terms of release delays and our very own time, there are definitely more interesting issues to dedicate our time on. I appreciate the tech discussions, but ultimately here we're talking about an experimental interface which most users won't care about. Some other users will have very specific high end requirements, and those are our target: I don't appreciate how we spend more than 30 minutes arguing how these smart guys might get confused by a method name. We're not changing the Session contract or anything big like that, we're providing a damn useful feature but really the method name or signature is not so relevant, but it's important that we address the right problem: - sane (no null parameters) - fulfill the requirements of flexibilty that we expect from a user extension point (be able to return a Set) - make sure it's not a performance bottleneck (implementable without too many object allocations) Given this, I'd prefer you to merge my PR from branch HSEARCH-1429 as it fullfills all the requirements. (that's pull https://github.com/hibernate/hibernate-search/pull/502 ) and move on, unless you have some really good argument against it, putting the time & features into perspective. Alternatively for the sake of moving forward, I'll craft a pull which just adds the @Experimental and some docs warnings, but I think we're failing to deliver a good feature which is ready to be delivered today -> very sad. Sanne

Hardy Ferentschik

7:06 a.m.

On 8 Jan 2013, at 1:38 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

The discussion is about more than adding a method a single method. We have a general disagreement on how to evolve an API.

...

I appreciate the tech discussions, but ultimately here we're talking about an experimental interface which most users won't care about.

I dislike the argument that just because I mark an interface as experimental I can add whatever I want to it, abandoning other sounds design principles.

...

- sane (no null parameters)

I'd rather deal with a potential null parameter in a context object in exchange in for simple and easy to grasp interface.

...

- fulfill the requirements of flexibilty that we expect from a user extension point (be able to return a Set)

IMO returning a set is pure guess work. Either you know your shard it or not. In which case would you return multiple?

...

Alternatively for the sake of moving forward, I'll craft a pull which just adds the @Experimental and some docs warnings, but I think we're failing to deliver a good feature which is ready to be delivered today -> very sad.

Still my preferred choice for now in order to get the release out of the door. --Hardy

Gunnar Morling

6:25 a.m.

2013/10/8 Sanne Grinovero <sanne(a)hibernate.org>

...

Yes, the implementation would likely look as you describe. Personally I think users would be able to deal with an optional property of a parameter object and act accordingly; I can see your concern, though. When setting out with getShardIdentifier()/getShardIdentifierForDeletion(), there might be a proliferation of methods when supporting new types of input such as the actual entity; we'd then have to add getShardIdentifierByEntity() and getShardIdentifierByEntityForDeletion() and so on. I guess both approaches would work, in the end its not that a huge difference. On 8 October 2013 10:03, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

> > See the comments I made on the pull requests. I also don't think the deletion case should return a set. Either one knows the shard it and returns it or one does not know and you have > to return null. In the latter case it is up to us to apply the default strategy. IMO it is wrong to say to the user "if you don't know just return all shard ids". It takes away the option from us > to distinguish between these two cases. > In the mail I've sent yesterday I've described an interesting use case to return an empty set for nightly index rebuilds. That should highlight that depending on the scenario there is a full range of options, from the empty set to the universe set. We're clearly talking about sets. BTW It was your proposal initially to use Set, that was a very good one. I don't see the need for us to "distinguish between these two cases", for one because there are more than two cases, but especially as we have no use of making this distinction.

Here I'm still missing a piece I guess. So far we spoke about returning null, the *empty* set or a set with *all* ids. Is there an example where one would return *some* ids (the logging example returns exactly one id)? In other words, I can't imagine a case where one would conclude *some* shard ids from a given entity which IMO would really justify using a set. If it is only about returning the info "XYZ is the exact shard id" vs. "can't determine a shard id, so consider all", a defined constant may be returned in the latter case, signaling that getAllShardIds() is to be called after that. Regarding your example of returning the empty set to skip index updates on deletions, how would that work in a non-sharded environment? Is using the shard id provider really the right approach for this problem? --Gunnar Sanne

...

_______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Emmanuel Bernard

8:59 a.m.

I don't have much stake in the specialized method vs context object debate as indeed the interface is very specialized and prone to changes. But as Sanne mentioned, there are memory pressure consequences if this call is in the hot path. It is correct that the current use of ForDeletion requires to use a non remote non async backend at the moment. That's something I discussed with Sanne back when I implemented it. It's not hard to imagine how we could transport such information in a later version but that would require additional contracts. The use case I designed dynamic sharding for is to: - create one index per user (think login) - query only by a specific index - apply mutation and deletion on a single index - support 100s users (ie shards) per VM instance I also had Bloom filters in mind when I designed the original sharding strategy. For these scenarios, a smart ForDeletion is necessary as you don't want to open / query hundreds of indexes unnecessarily. That Hardy thinks the use case is wrong is beyond me but if that's the general feeling, that's fine, I'll fork Hibernate Search and make it useful for me. For the record, I handed over a working solution 6 months short of 4 days... I am sure it was not perfect, but certainly not 6 months away from it. I know you guys wanted support to inject a Session to resolve shards which has put some significant constraints on the life cycle. But still. Conclusion Draw your own, I'm out of it. Emmanuel On Mon 2013-10-07 16:03, Sanne Grinovero wrote:

...

On 2 October 2013 14:34, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > On Tue 2013-09-24 14:30, Sanne Grinovero wrote: >> On 24 September 2013 14:12, Hardy Ferentschik <hardy(a)hibernate.org> wrote: >> > 2) remove 'String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString)' from ShardIdentifierProvider >> >> +1 we're automatically assuming a deletion needs to be routed to all >> identifiers. > > Bad idea as I explained in my previous email. Plus we could already make > use of that if we reuse Hibernate ORM's tenantid value. I've tried hard to find an agreement on this, but it seems we're wasting time without making progress. I'm not happy in ignoring a strong recommendation from any of you, very hard choice :-( Hardy are you going to reconnect later? Could you reply to this email of Emmanuel? I'm inclined to add the method back, so that it's the users choice to pick his battle. As mentioned below, I don't think we should take options away from them. Of course our template implementation could provide a sensible default method, so all users looking for simplicity don't need to bother too much about the extra method. @Emmanuel the last conversation we had on the subject is below: --Sanne [15:05] <sannegrinovero> hardy: on the dynamic shard id concerns rised by emmanuel, I thought you where going to propose a pair of method names that would suite you? [15:05] <sannegrinovero> I still think we need to restore the missing method. [15:05] <hardy> I don't think so [15:06] <sannegrinovero> emmanuel had quite a strong opinion about it, don't think it's fair to ignore that. [15:06] <hardy> I thought more about it and I think the use case is not even imlementable [15:06] <sannegrinovero> you have a point there. [15:06] <hardy> well, I think he is wrong [15:06] <hardy> take his use case [15:06] <hardy> he wants to use some sort of customer id or ORM shard identifier [15:07] <hardy> sounds great, but there is no such context to get it from [15:07] <hardy> so what can you do in this case? [15:07] <hardy> a ThreadLocal [15:07] <hardy> and that's exactly the problem [15:07] <hardy> if the shards id were determined at the document built time (as we want to do it ), it might be possible [15:07] <sebersole> amazing how often "context" is problematic :) [15:08] <hardy> he he [15:08] <sannegrinovero> no. as far as I understood,he was planning to get a reference to the Strategy, and then invoke setters on it to "program" the thing. [15:08] <hardy> but now the shard identifiers are "generated" when the changes are getting applied to the index [15:09] <hardy> that's happening on a different thread [15:09] <sannegrinovero> hardy: we're going in circles with this debate on abstract hypothesis. Emmanuel said he has a use case for it, and implemented it. that should be good enough for us? [15:09] <hardy> no [15:09] <hardy> I honestly would like to see the code first [15:09] <sannegrinovero> :-) let me try a proposal [15:10] <hardy> and how does it work in a clustered environment [15:10] <hardy> or JMS [15:10] <hardy> I seriously doubt we can implement this in a decent way atm [15:10] <sannegrinovero> WDYT of this plan: we re-introduce the method, and provide the abstract base class I've made; then the deletion method has a default implementation. [15:11] <hardy> one beauty of the new interface is, is that it is simpler [15:11] <hardy> and imo it removes something which was conceptually not working anyways [15:11] <sannegrinovero> then in future we can deprecate this method. [15:11] <hardy> I don't see a point of re-introducing it unless someone can actually provide a working example [15:12] <sannegrinovero> he made one already, he just couldn't show it, but described it. [15:12] <sannegrinovero> it's much easier to remove a method from an SPI interface. [15:12] <hardy> no [15:12] <hardy> so what was his example? [15:13] <sannegrinovero> I told you, he takes the reference to the sharding Strategy, and sets the context explicitly. [15:13] <hardy> again, how [15:14] <hardy> explain me how this is going to work on e.g. a JMS master [15:14] <sannegrinovero> hardy: org.hibernate.search.engine.spi.EntityIndexBinding.getShardIdentifierProvider() gives you access to it. [15:16] <sannegrinovero> hardy: consider that this feature is power-users only. Some will definitely want to control the deletion. Some will do it wrong, well that's not my problem. [15:17] <sannegrinovero> hardy: but it definitely is annoying to not provide *any* way for a user to hack his way into it. [15:17] <hardy> sannegrinovero: yes, there is EntityIndexBinding.getShardIdentifierProvider() [15:17] <hardy> but how would you use that on a JMS master [15:17] <hardy> you still need to know what to set [15:17] <sannegrinovero> hardy: I don't feel it's our right to consider our users dumb, some will definitely have smarter ideas. [15:17] <hardy> how would you determine the customer id on the JMS master [15:18] <hardy> I don't consider them as dumb at all [15:18] <hardy> but write me a test or example setup [15:18] <sannegrinovero> hardy: I can attach lots of custom attributes in a JMS message. I can do it from my custom backend, or even use routing options if I have something like Camel. [15:19] <sannegrinovero> hardy: you actually made an excellent use case with JMS :) [15:19] <sannegrinovero> hardy: but I'm not going to code a full JMS + Camel app to show you :-D [15:23] <hardy> sannegrinovero: and how do you create a custom message? [15:23] <hardy> how is this all wired up? [15:24] <sannegrinovero> hardy: that's system setup. For example, the shard id could be selected by the originating machine: the routing process of the JMS message could add this as context. [15:24] <hardy> but that's not how it works atm [15:24] <sannegrinovero> hardy: say I have an EAP6 instance per shard running as client, and have a single master shared across them. [15:25] <sannegrinovero> hardy: that works today. We're not controlling how the messages are sent around in a SOA environment. [15:25] <hardy> whatever [15:26] <sannegrinovero> WDYM ? [15:26] <sannegrinovero> it's a powerful use case, I'm not feeling comfortable in denying the option. [15:26] <hardy> apparently you want it back, but you also are not providing a working example [15:26] <hardy> it is contrieved [15:27] <hardy> and why would you return a set of shard ids [15:27] <hardy> as in the original proposal [15:27] <hardy> at least it should be a single id as well [15:30] <sannegrinovero> hardy: I never disagreed with you about the method name not being ideal. But this JMS integration point got me quite excited now on the routing options it provides. [15:30] <sannegrinovero> I mean, the API really feels lacking without the method for deletions. [15:30] <hardy> not at all [15:31] <hardy> but suite yourself. I got to go anyways [15:31] <-- hardy (~hardy@redhat/jboss/hardy) has left this server (Quit: bye bye). [15:37] <sannegrinovero> gmorling: I'm puzzled about the dyn shard SPI. Don't like to take an action with hardy being so fiercely against it. WDYT? You know if he'll be back soon? [15:38] <gmorling> sannegrinovero: no, unfortunately i don't know when he'll be back; would be nice though to come to a commonly agreed upon solution [15:41] <sannegrinovero> gmorling: right, which makes it even more important to make sure that we're not dropping existing use cases: the existing one is able to pick a single delete index. [15:42] <gmorling> yes, if SIP is intended as replacement it shouldn't offer less functionality (given the previously functionality was sound)

Emmanuel Bernard

Wednesday, 2 October Wed, 2 Oct

8:29 a.m.

On Tue 2013-09-24 10:51, Hardy Ferentschik wrote:

...

String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); all together. Here is my reasoning. AFAIU, the method is there for the deletion of documents. In this case we don't have the Lucene document nor the entity and we need to know in which shard the document to delete is. The assumptions behind this method is that somehow given the type and id I am able to provide this shard or a subset of the shards. I doubt, however, that this is practically ever possible. In the end most implementations will have to just delegate to getAllShardIdentifiers() anyways. Take the language code example or any other case where I shard depending on a given property of the entity. In this case I will never be able to make any use of #getShardIdentifiers(Class<?> , Serializable , String) In fact the same arguments probably apply to getShardIdentifiersForQuery. What is the use case for that really? In which use case can the set of targeted shards be limited based on knowing the type of filers we apply? So why not remove #getShardIdentifiers and #getShardIdentifiersForQuery and start of with a much simpler interface. We can indeed mark it as experimental and if the need arises (based on a true use case) think about optimisations. The more I think about it, the more I like this more minimalistic approach.

When ISStrategy was introduced the idea was that somehow, the implementor could receive information from the runtime with the right (set of) shard(s). For example, in a multi-tenant app and I *know* what shard the currently logged user is allowed to temper with. I will always filter by that shard even for deletion.

Gunnar Morling

Tuesday, 24 September Tue, 24 Sep

4:25 a.m.

2013/9/24 Sanne Grinovero <sanne(a)hibernate.org>

...

Correct me if I'm wrong, but trying to synthesize this discussion I think that we're fundamentally agreeing that dynamic sharding is a "better replacement" for static sharding.

Yes, from what I understand I think that's right. To me, the question really is what actually is the difference between "dynamic" and "static" sharding? Isn't both just two variants of the same concept? In both cases we examine a document (or entity) and determine the right shard at runtime. For one implementation of this mechanism we just happen to have an additional configuration which is the number of shards for hash based sharding. So in one case we know the number of shards up-front, while we don't know in other cases. But is this really a conceptual difference which should reflect in different SPI interfaces? Which part of the "new" dynamic sharding (i.e. where we don't know the number of shards up-front) can't be addressed by implementing the existing IndexShardingStrategy interface? If the existing getIndexManagerXy() methods are fine and it is only about additional information to be passed at initialization time, we could also do the following: * inject this information into marked constructors or fields at instantiation time (this would be a new paradigm in Search, though) * provide an additional contract only for this initialization and support IndexShardStrategy implementations which also implement this initialization contract: public interface ShardingStrategyInitializable { void initialize(WhatEverNeeded initializationContext); } public class MyDynamicShardStrategy implements IndexShardingStrategy, ShardingStrategyInitializable { ... } In both cases a "dynamic" sharding strategy implementation would be essentially the same as any other strategy, it would only retrieve additional information during initialization. This seems more consistent and easier to grok to me than handling "dynamic" sharding in a different manner. Such an approach would allow to add support for the feature in 4.4 without breaking stuff and give us the chance to create a new contract in 5 if required, e.g. (optionally) based on the entity, with less methods etc. Btw. I think this case is a good example showing that it is a good idea to provide abstract base classes for SPI contracts, allowing to add new methods in later versions without breaking existing implementations (at least until we can use default methods from Java 8). --Gunnar

Emmanuel Bernard

Wednesday, 2 October Wed, 2 Oct

8:21 a.m.

What does Iterable<String> give you over String[]? On Mon 2013-09-23 23:04, Sanne Grinovero wrote:

...

Correct me if I'm wrong, but trying to synthesize this discussion I think that we're fundamentally agreeing that dynamic sharding is a "better replacement" for static sharding. Still, let's keep in mind that this needs to be a backwards compatible patch, so we're not looking for something disruptive of the current static sharding feature. If we end up agreeing that the better API needs disruption, we should still make an incremental change available in this version - as experimental - so that people can play with it, then we're free to reset the API as we wish in 5.0 but at least including the newly acquired experience. We then had some brainstorming on IRC which concluded that it would be probably more user friendly to have the __Strategy to: a) not return arrays but simple collections b) avoid the confusion on the two too similar methods on the proposal for ShardIdentifierProvider (next paragraph) c) not apply the sharding logic based on the Document (fields) but on the entity (the user type triggering the index event) As a reminder, the proposal for ShardIdentifierProvider is - omitting initialize - : /** * Returns the shard identifier upon addition. */ String getShardIdentifier(Class<?> entity, Serializable id, String idInString, Document document); /** * Returns the set of shard identifiers upon deletion. */ String[] getShardIdentifiers(Class<?> entity, Serializable id, String idInString); /** * Returns the set of shard identifiers for a query. */ String[] getShardIdentifiersForQuery(FullTextFilterImplementor[] fullTextFilters); /** * Returns the list of all known shard identifiers. * The list can vary between calls. */ String[] getAllShardIdentifiers(); #a Ok we can consider that but let's see how the code turns out. Iterable<String> perhaps? #b we explored converging the two methods into the essential one: String getShardIdentifier(Class<?> entity, Serializable id, String idInString); but that seems very poor in terms of flexibility, it doesn't even allow access to the basic properties of the indexed entity. So that degenerated in the proposal #c, but before moving to #c I'd venture that these methods aren't that bad, they just need good documentation. Also, even if we move the focus from the Document to the Entity, we still don't have the fully loaded entity during a delete operation, so we would still need the second method returning multiple indexes. #c drawbacks first: = to expose the entity directly has probably some risk (the user making changes to it), but that's no different to what the user can already do from a FieldBridge / ClassBridge. = won't be able to consider the output of FieldBridge / ClassBridge instances as you won't have the Document = for deletion you still don't have the entity We could list some conceptually interesting advantages here, but I'd like to shut down this feature for the time being because it's significantly different than the goal of providing Dynamic Sharding as a feature. I'm not saying that it doesn't have merit: its probably worth exploring for 5+ but it rather seems like an _additional_ level of sharding that we might want to add in future as an alternative to the one focusing on the Document approach. So assuming that sounds reasonable, and that we're on the same page for a#, let's focus on b#: the duality of the methods for add/remove during sharding. My position is that it worked well so far on static sharding, and that the proposal is quite consistent with it so wouldn't be much of a pain for people to adapt the new model. Let's try polishing the method names (and use your imagination for a well written javadoc): String getShardIdFromDocument(Class<?> entity, Serializable id, String idInString, Document document); Iterable<String> getShardIdsFromId(Class<?> entity, Serializable id, String idInString); or maybe to highlight what's fundamentally different: String getShardIdFromContext(Class<?> entity, Serializable id, String idInString, Document document); Iterable<String> getShardIdsFromReducedContext(Class<?> entity, Serializable id, String idInString); I especially like the javadoc! well done. WFYT? Bonus dilemma: should we stay away from String and define some "IndexIdentifier" interface ? Note that while design wise it might look good, this is on a very hot path so while I'd like proposing such an SPI it would be great if we could find a way in which this would not require to allocate objects at runtime just to replace a mere string with a safer type. Sanne On 23 September 2013 13:58, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > > On 23 Jan 2013, at 1:55 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote: > >>> Or I set a custom sharding strategy which does not care about the number of shards? >> >> I think that's far fetched. The NBR_OF_SHARDS option defines the size >> of the array of indexes passed to the IndexShardingStrategy so it's >> hard to ignore. Sure it's possible, we could throw a >> log.userIsAnIdiotException() but someone might not see the humor :-). >> Worst case it degenerates in a case similar to your example of >> "NotShardedStrategy and nbr_of_shards >1", or the user would notice >> with an ArrayOutOfBounds.. usercode problem. > > I think I am thinking about a custom dynamic sharding strategy in this case. > >>> IMO the important factor is to set the right sharding strategy and nbr_of_shards should just be a (optional) parameter to the sharding strategy. >> >> Note that so far we don't expect users to explicitly set the >> NotShardedStrategy: it's simply a consequence of not having set any >> option; if the user sets only the number of shards but omits picking a >> specific strategy, we automatically assume he's going for the >> IdHashShardingStrategy > > I get that, but as I said, I see it as a bit of misguided "ease of use", since it can introduce > problems (see above) and is not consistent. > >> As soon as a different IndexShardingStrategy is chosen, then I think >> it's quite self-explanatory that setting NBR_OF_SHARDS is quite >> useful: the user will have coded an explicit IndexShardingStrategy and >> consequentially have a clear idea of how many shards he wants, at >> least for the static sharding so far. > > Right, for static sharding maybe, but we are going to make this dynamic now. > >>> With dynamic sharding things get more complicated. Right now you configure dynamic sharding by setting 'nbr_of_shards' to the literal 'dynamic'. This selects under the hood the >>> right IndexShardingStrategy (DynamicShardingStrategy). I find it misleading on multiple levels. First 'dynamic' is not a number and secondly I want to configure a strategy >>> not the number of shards. It is also inconsistent with how we select/configure other pluggable components in Search. For that reason I suggest: >>> >>> - The type of sharding is configured via setting hibernate.search.[indexName].sharding_strategy. 'nbr_of_shards' is a parameter which gets passed to the strategy and which >>> might get ignored depending on the sharding implementation. Implementations are free to check the property and e.g. print out a warning if the settings does not apply to them >> >> Conceptually it sounds nice. >> I see two downsides: >> - it pushes complexity to the IndexShardingStrategy implementor (the >> user) as he needs to parse it and somehow he needs to request those >> indexes from the SearchFactory to be built. Pushing both these >> responsibilities to the end user in exchange for a one-liner in the >> configuration file seems like an odd choice? I would agree if it was >> us to write the code, but I really expect most people to plug their >> own strategy as IdHashShardingStrategy isn't very useful in a real >> world app. > > I think that if the API to create an index manager is simple it basically unifies static > and dynamic sharding. > >> - today we pre-initialize the indexes (IndexManagers) before they are >> passed to the IndexShardingStrategy # initialize method. We would need >> to pass instead some lifecycle-controlling objects which allows the >> user to trigger index initialization. Again I essentially agree but >> that sounds much like dynamic sharding? > > With the right API both would be possible in a simple to understand way. > >> I don't think we can change these in the scope of 4.4 as it affects >> current API. Shall we take this inconsistency point as >> yet-another-reason to migrate to Dynamic Sharding? While the new >> feature matures, I suspect it could completely replace the static one. > > I think so as well, in which case we need to make sure that we get the API right. A > new/updated initialise contract might be exactly what we need instead of yet another patch. > See also my email regarding ShardIdentifierProvider. > >>> - We introduce short names for the provided sharding strategies - 'none', 'id-hash', 'dynamic'. This will avoid the need to reference concrete implementation classes >> >> -1 : as I reminded above, I don't expect id-hash to be of practical >> use, people want to plug their own strategies which implies we need >> the concrete implementation classes. I'd rather see the >> IdHashShardingStrategy as a concrete example we're providing (not just >> an example, I guess someone might find it useful in production, I just >> think it's a minority of the IndexShardingStrategy users). > > IdHashShardingStrategy is in use right now, at least if you enable sharding without any other specific > IndexShardingStrategy implementation. Providing a short name of it is inline with configuration options > like 'ram' or 'filesystem' for directory provides. We could use 'default' to hide the fact which impl > we are using. This way we could even replace the impl in case we find a better one. > >>> - For dynamic sharding we have the additional sub-property 'shard_identity_provider' which specifies the ShardIdentifierProvider (new contract needed for dynamic sharding). >>> This property is only relevant for dynamic sharding and will be handled in the same way as 'nbr_of_shards' >> >> To recap today we have >> hibernate.search.[indexName].sharding_strategy = [implementation >> of IndexShardingStrategy] >> >> Would it not be nice if I could either specify an implementation of >> IndexShardingStrategy or a ShardIdentifierProvider ? >> hibernate.search.[indexName].sharding_strategy = [implementation >> of IndexShardingStrategy | ShardIdentifierProvider] > > hmm, I have not thought about it this way. So far I was more thinking along the > lines of removing ShardIdentifierProvider. But you are proposing to keep it > and maybe in the long run remove IndexShardingStrategy? > >> - being the same property keeps it clear that you can either specify >> one OR the other. > > Except that we are talking about to different interfaces. Hardly good practice to offer this type of > confguration. > > --Hardy _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

4677

days inactive

4695

days old

hibernate-dev@lists.jboss.org

Manage subscription

27 comments

4 participants

tags (0)

participants (4)

Emmanuel Bernard
Gunnar Morling
Hardy Ferentschik
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Search] Dynamic sharding configuration