[hibernate-dev] [Search] Dynamic sharding configuration

Mon Oct 7 16:06:48 EDT 2013

Hi Hardy,
could you have a look at the following two commits, while I work on a
test as you suggested.
(documentation will follow depending on which one you like best).

In this case I just add the missing method, and I don't think it's
bad, actually the name is fine and while I'm sure you might have some
ideas on the javadoc, I think it's relatively clear what it all means:

https://github.com/Sanne/hibernate-search/commit/9a1a542e551784565e6536c2ea56f1a9cb29e535

In the following commit, which requires the previous one, I'm
introducing a new interface AdvancedShardIdentifierProvider.
For as much as I don't like having too many SPIs, I think we agree
that this one is addressing power users only.
No configuration changes are needed, the documented example stays fine
as it is: the user just has the choice of optionally implementing the
more advanced interface, we'll pick it up from there with a simple
"intanceof".

https://github.com/Sanne/hibernate-search/commit/0f22a594075ae7364d8daf7218a4bb8656ad6aca

I'll work on a test, and docs update as soon as you can let me know
which approach you prefer.

Addressing some of your comments below:

On 7 October 2013 18:26, Hardy Ferentschik <hardy at hibernate.org> wrote:
>
> On 7 Jan 2013, at 5:03 PM, Sanne Grinovero <sanne at hibernate.org> wrote:
>
>> I've tried hard to find an agreement on this, but it seems we're
>> wasting time without making progress.
>> I'm not happy in ignoring a strong recommendation from any of you,
>> very hard choice :-(
>
> In the end it is your call. I tried to give arguments for my position, but we seem
> to have general disagreement on how to develop/evolve an interface.
>
> If you want to have a specific method for deletion I recommend:
>
> String getShardIdentifierForAddition(Class<?> entityType, Serializable id, String idAsString, Document document);
> String getShardIdentifierDeletion(Class<?> entityType, Serializable id, String idAsString);
>
> So I would re-add the suffixes 'ForAddition' and 'ForDeletion'. Also I'd change the return type of
> getShardIdentifierDeletion.

The return type will be _Set<String>_.
'ForAddition' is misleading as it's not used just for additions, I
think we discussed these already.

>
> I test would be nice as well. Maybe if we see an actual example coded out we would have a better ground
> for discussion.

I'll make one, but I hope you'll not be too severe: a fully fledged
example would take much time, I hope to find one which is somehow
providing the intuition, but not the full JMS routing example I
mentioned earlier which is IMHO the strongest advocate.

> Also, what are the concerns here? Performance, because I target all shards for deletion or security, aka
> a deletion is send to a shard which potentially belongs to a different customer.

Right performance is a strong point, but when dealing with
multitenancy you might also have legal requirements; DOSing a
different tenant might be a violation of terms.

>
> What is the actual performance gain between the two different scenarios? A factor of 2, 4, 10?

Depends, the benefit is obviously proportional on how many shards you
have and how often you delete :-)
Remember that with NRT we can do writes relatively quickly, but a
delete will always require a disk sync. A disk sync is a very strong
barrier of course, we we're more likely in the area of 3 to 4 orders
of magnitude for a delete vs. a write.
Of course we would still have a delete, but on less indexes. You could
think then that the cost is just a factor of how many shards, but
consider also that the above cost is not actually paid for during the
delete flush, but at query time: the query will trigger a
pre-execution flush.
So there are cases in which I might be sending deletes on index A, and
running queries on index B, having indexes on B *much* faster because
of the non-sync happening.
On top of this you have to account for FieldCaches to be invalidated
because the index is dirty, or FullTextFilters which need to be
re-computed unnecessarily.

It can all pile up, interacting with each other, making quite an ugly
difference. Granted, I'd have to make a very unrealistic test to
highlight it, so let's keep it to the theory, but I think that you can
easily appreciate how it can make a significant difference in complex
real world applications.

Cool idea:
Considering this all, I guess an interesting use case is to use a
Strategy which always returns an empty set for deletions. Some people
like to reindex at night to keep good performance over the day and
disable our event listeners, to run the MassIndexer overnight. I guess
a nice tradeoff would be to just skip delete work on the index:
Hibernate Search won't return non-existing matches anyway, and the
garbage would be cleaned up with the MassIndexer run overnight, but at
least you would still have updates applied in real time. Might need
some validation, but considering the IO cost of deletions (and
indirect cost on filters and caches) I would seriously have considered
such an approach as a user.

 --Sanne