I ended up implementing a solution where the first chunk of data for any given partition is always tackled with UpdateLuceneWorks (because we don't know whether we're restarting or not). So for instance with checkpointInterval=200 and rowsPerPartition=1000, for any given partition the 200 first elements will be indexed with UpdateLuceneWorks, while the 800 others will use AddLuceneWorks. It's a bit more efficient than just using UpdateLuceneWorks everywhere, and it gets more and more efficient as the checkpointInterval/rowsPerPartition ratio decreases. This is obviously not optimal, but for now that's the only way we can make failure recovery work reliably.
HSEARCH-2728 Open may allow us to implement a more efficient solution, but that will require more work, and in particular some changes in the engine. |