| So. After a bit of investigation, the problem was actually that we performed mutliple purges, one after another, which targeted the same index (since they targeted classes in the same inheritance tree), but not the same types. Several separate things combined make this fail. I numbered each fact for easier reference below.
- Even with refreshAfterWrite enabled, we don't refresh between each single write, but simply after each changeset
- When someone calls fullTextSession.purge, the fullTextSession will create one work per concrete type, not just one work for the given type
- ... and we also do that in the Elasticsearch integration when a purge is requested. As a result, when calling fullTextSession.purge on classes A, B and C (with C extends B extends A), we'll execute the following works: purge(A, B, C), purge(B, C), purge(C), instead of simply executing purge(A, B, C) OR executing purge(A) and purge(B) and purge(C). We effectively purge each type multiple times.
- DeleteByQuery (used for purges on Elasticsearch) is a two-step operation: it first performs a search to retrieve the set of documents to delete, and then perform the actual deletion, failing if the (internal) version number retrieved when searching doesn't match the version number retrieved when deleting. As a result, using delete by query on a outdated (not refreshed) index may fail, even if the document was simply deleted in-between...
- When multiple purges are requested in a transaction, their relative order may be altered randomly, because works are first stored in a HashMap (whose order may change upon inserts), and are transferred from this map to a list before we execute them (see org.hibernate.search.engine.impl.WorkPlan.getPlannedLuceneWork()).
- Since https://github.com/hibernate/hibernate-search/commit/0ad3bab4450da36005dc801f4a046bb27fc1d57a, when one work in a changeset (list of works) fails, we don't execute the remaining works.
All of this combined will make the test fail:
- Because of 1, 2, 3 and 4, executing multiple purges in the same changeset on the same class hierarchy may fail (semi-randomly, because when we're lucky Elasticsearch may execute an automatic refresh between each purge)
- Because of 5 and 6, the test failure is random (if purge(A, B, C) is executed first, then everything will be purged, and then even though the background thread will log a failure, the test will pass)
I believe there are multiple problems here. Fixing just one may be enough to make the test pass again, but I'd like to fix as many as possible... To be precise:
- Issue 1 should at least be mitigated. I see two (non-exclusive) approaches:
- When refreshAfterWrite is enabled, I think we should just refresh after each write operation. Most of the time it will not change anything, because most of the time we just execute one bulked work. When executing multiple bulked works, the impact on performance should be small (one refresh every 250 works, and we could make this a configuration property)
- When refreshAfterWrite is not enabled, I think we should force a refresh just before each DeleteByQuery, so that a purge between two additions to the index will work fine. DeleteByQuery is not something
- Issue 3 (concrete types are resolved both in the ORM integration and Elasticsearch integration) should be fixed. We should not resolve concrete types in the Elasticsearch integration.
- Issue 5 (the order of works within a transaction is not deterministic) should, in my opinion, be fixed. This kind of behavior is simply hell when we debug it. Using LinkedHashMaps instead of a HashMaps will solve the issue easily, and I doubt the impact on performance will be dramatic. At least we should do it at the class level (we rarely index hundreds of different classes in a single transaction, so this should really not change much performance-wise).
- Issue 6 could be fixed, even though it's not absolutely necessary. But if we consider that the order of works doesn't matter (see issue 5), then stopping the execution of a changeset at the first failure doesn't make much sense, because the next works in the list should not be considered as depending on the failing one (it may even be the reverse: the failing one depended on the next works in the list).
Re-qualifying as bug (the problem was already there in previous releases, even though it may simply have resulted in errors being logged) and changing the title to better reflect the issue. |