... if we passed an offset and a limit to partitions, instead of passing a first and last ID, we would solve lots of practical issues we're having... (partitioning for HQL/Criteria, support for embedded IDs, ...). But maybe there was a particular reason to do it that way?
The approach of using "an offset + a limit to partitions" cannot guarantee the correct coverage of the indexation. If there's any change in the database after the start of the job execution, e.g. insertion or deletion, the indexation will target to the wrong range and lead to missing data or duplicate data. Here's an example: we want to index 3,000 entities having ID from 1 to 3000 via 3 partitions (1000 rows per partition). The offset is respectively 0, 1000, and 2000. The limitation of each partition is set to 1000 rows. And here're some of the cases:
- If everything goes well, each partition will finish its partition correctly.
- the 1st partition will index the range [1, 1000]
- the 2nd partition will index the range [1001, 2000]
- the 3rd partition will index the range [2001, 3000]
- If the row of ID=500 is deleted in the database,
- the 1st partition will index the range [1, 1001]
- the 2nd partition will index the range [1001, 2000], one duplicate found
- the 3rd partition will index the range [2001, 3000]
- If the row of ID=500A is inserted in the database (suppose that it's possible),
- the 1st partition will index the range [1, 999], because 500A takes one place, and the 1000-rows-limit ends to ID=999.
- the 2nd partition will index the range [1000, 1999]
- the 3rd partition will index the range [2000, 2999], the entity ID=3000 is missing.
|