Message Title

After a bit of discussion, it's clear that the solution I proposed won't always work. We cannot rely on the persisted batch state to determine what has been written into the index, simply because we can have a failure between the moment we write into the index and the moment we persist the batch state.

So, Sanne Grinovero's solution of having a hidden field identifying the "work unit" that added a given document to the index seems to be the only workable one. Here is the todo list:

We have to decide how we build the "work unit identifier".

We have multiple checkpoints within a partition, so using the partition ID as discriminator is wrong, it should be something like a "checkpoint ID". I don't know if such thing exist, but worst case I guess we could combine the partition id with some checkpoint ID (by storing the number of successful checkpoints in the partition's persistent data).
We must avoid at all cost our delete query to delete out-of-scope documents. This means the discriminator must also include something that will identify our job instance uniquely, while still being the same after restarting the job. Fortunately, such an ID is available through JobContext.getInstanceId(), which is guaranteed by the spec to be "a globally unique value within the job repository".

We have to implement this hidden "work unit identifier" field: offer ways to define its value (when building an AddLuceneWork/UpdateLuceneWork? or when building a document?), and handle its addition to indexed documents (in Lucene and ES). Note: I'd rather do this in a separate ticket, and maybe even merge it to master instead of the jsr352 branch, because it's an important change and shouldn't be drowned in the many commits related to JSR-352 integration.
We should implement these delete queries, which should occur before we write anything, upon each checkpoint.
Maybe we should provide a way for users to not use this hidden field, in which case we'd either not support restarting a job (a dubious choice in my opinion, since recovery is the main strength of JSR-352) or always use UpdateLuceneWork (which would probably be very slow).
I don't know anything about the performance of delete queries. I guess the number of items between two checkpoints must be large in order to make delete queries efficient; let's hope it doesn't need to so large that checkpoints won't make sense anymore (e.g. 10 000 is already a lot, IMHO). This obviously will require some testing.

Add Comment

This message was sent by Atlassian JIRA