| To sum up, the problem here is that when a partition fails, but has already written some documents to the index, the re-execution of this partition may add some documents a second time to the index. This is because we use AddLuceneWork (which is fine, because it's more efficient) instead of UpdateLuceneWork. In the original ticket on GitHub, Sanne Grinovero proposed several options (Sanne Grinovero, please correct me if I'm wrong): 1. control the flush operation; there should be ways to strictly prevent it from flushing in background as we control the Merging thread; however this could lead to OutOfMemory problems with the Lucene buffers. Seems to me flushing is also a potential source of errors, so this wouldn't solve the case where the flush fails. 2. use the Lucene checkpoint API to create a safe checkpoint to return to. This would be Lucene-specific, and probably not doable with Elasticsearch. Not an option in my opinion. 3. avoid passing lists of ids among the job steps, but pass the individual tasks. That one I must admit I don't understand... Even if we passed the tasks between the steps (read entities => process entities => write documents), we'd still face the same issue of guessing whether we should add or update a given document. I wonder though... Couldn't we just detect, when executing a partition, whether this partition has already been attempted, and if so, switch to a less efficient, but safer "update" mode? In StepContext, we have a way to register persistent data: StepContext.setPersistentData. I had a look at how this is persisted, and it seems both JBeret and JBatch persist this even when the partition fails (though the spec doesn't seem to provide any explicit guarantee). We could try the following:
- In EntityReader.open (which is, crucially, the very first thing we run in a partition), before doing anything, check the StepContext transient data. If it's null, initialize the data:
- create the transient data
- If the StepContext persistent data is null, create it with a status flag set to "FIRST_RUN"
- if the StepContext persistent data is not null, update its status to "RERUN"
- In LuceneDocumentProducer, whenever we create a work, check the StepContext persistent data:
- if the status flag is set to "FIRST_RUN", create an AddLuceneWork
- if the status flag is set to "RERUN", create an UpdateLuceneWork
I think it's worth a shot. But first, obviously, we'll need to create a failing test case demonstrating the issue  |