Mircea and I concluded that it is worth keeping the current pull state
approach and bolting proper tx log draining unless this solution becomes
more complex than the original push state approach that already
serialized state sending and tx log draining.
To summarize, during leave rehash, we need state senders to drain tx log
(InvertedLeaveTask#processAndDrainTxLog) *after* all state receivers
have transferred the state. As things stand right now
(InvertedLeaveTask#performRehash) tx log draining is interleaved with
state transfer leading to problems described in the above mentioned JIRA.
The solution I have in mind is to introduce a
Map<Integer,CountDownLatch> in DistributedManagerImpl. They keys in this
map will be view ids for the leave rehash while CountDownLatch will be
initialized to the number of state receivers. As state receivers pick up
state we countDown on the latch. State provider awaits on a latch for a
given view id and a timeout. When await returns it drains the tx log.
Let me know what you think.
Regards,
Vladimir