[
https://issues.jboss.org/browse/JGRP-1742?page=com.atlassian.jira.plugin....
]
Bela Ban edited comment on JGRP-1742 at 12/4/13 9:26 AM:
---------------------------------------------------------
OK, so the following things might solve this puzzle:
h5. Coordinator blocked during fetching of digest
* Since BARRIER is closed during a state transfer, the coordinator will not only drop
messages from other members (except from members for which holes were punched into
BARRIER), but also *from itself*
* Currently, the only message that causes problems when dropped is a VIEW change
multicast
* SOLUTION: when multicasting a view V, *have the coord install V locally before
multicasting it*. When receiving V, it will get dropped as it is already installed
h5. BARRIER skips threads in BLOCKED or WAITING state
* Don't skip these, as a blocked thread might simply block on a lock, before changing
state (e.g. Infinispan)
* If message P:10 was blocked, and we skipped it when fetching the digest, we'd
include P:10 in the digest, but not in the state. This would mean that the state requester
will never get P:10, neither as part of the state, nor as a retransmission from P
h5. Flushing of threads in BARRIER should time out
* We cannot wait forever for the threads to time out
* The timeout passed to {{getState(timeout)}} should be used to bound the max duration for
flushing the threads. If we run into a timeout, either at the state requester or the
provider, state transfer (the {{getState()}}) call will fail
* A timeout of 0 means wait forever
* Closing the channel should terminate the flush
h5. Things not tackled
* While flushing of threads might succeed in BARRIER, if the application has its own
thread pool (e.g. using the _Asynchronous Invocation API_) to handle requests, then
flushing will return quickly
* However, this is not a guarantee that all incoming threads have completed their changes
to the application state
* A possible solution might be to call the {{block()}} and {{unblock()}} callbacks in the
application. The former would have to wait until all current threads are done modifying
the application state. The latter would be called when the digest has been fetched and the
application pool can continue making modifications.
** Not very nice, but state transfer should not be used for very large states (taking a
long time) anyway
** This will not be addressed by this JIRA. Perhaps it will be tackled in a later
release.
was (Author: belaban):
OK, so the following things might solve this puzzle:
h5. Coordinator blocked during fetching of digest
* Since BARRIER is closed during a state transfer, the coordinator will not only drop
messages from other members (except from members for which holes were punched into
BARRIER), but also *from itself*
* Currently, the only message that causes problems when dropped is a VIEW change
multicast
* SOLUTION: when multicasting a view V, *have the coord install V locally before
multicasting it*. When receiving V, it will get dropped as it is already installed
h5. BARRIER skips threads in BLOCKED or WAITING state
* Don't skip these, as a blocked thread might simply block on a lock, before changing
state (e.g. Infinispan)
* If message P:10 was blocked, and we skipped it when fetching the digest, we'd
include P:10 in the digest, but not in the state. This would mean that the state requester
will never get P:10, neither as part of the state, nor as a retransmission from P
h5. Flushing of threads in BARRIER should time out
* We cannot wait forever for the threads to time out
* The timeout passed to {{getState(timeout)}} should be used to bound the max duration for
flushing the threads
* A timeout of 0 means wait forever
* Closing the channel should terminate the flush
BARRIER: minimize closing time
------------------------------
Key: JGRP-1742
URL:
https://issues.jboss.org/browse/JGRP-1742
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.5
During a state transfer, BARRIER.up() waits until all incoming threads (delivering
messages to the application) are done, and blocks further incoming messages. This is done
to get the digest and the state.
However, duing the block, the following messages are not sent up:
* Views !
* STABLE messages, triggering retransmissions
This is bad, so we should try to minimize the time BARRIER is closed. This can be done
with JGRP-1352.
However, we could also do the following:
* A state request is received
* Close BARRIER and flush all pending threads. This ensures that any message which
updated the *digest* also updated the *application state*
* Get the digest D
* *Open* BARRIER. Messages will now be delivered and thus applied to the state
* Get the application state S
* When done, return D and S to the state requester
The difference to JGRP-1352 is that we don't queue messages during state transfer.
How does this work ? It is critical to ensure that all mesages which updated the digest D
also updated the state S, or else messages present in D but not in S would not be
retransmitted. However, if there are more messages in S than in D, this is not an issue as
they will be retransmitted again.
Example:
* BARRIER is closed and pending threads are flushed
* Digest D is (only for a given member P) 5, state S is 5 as well
* Now we open BARRIER
* P sends a few more messages (6, 7 and 8)
* The digest is now 8, but the copy we have is still 5
* State S is 8
* We return D=5 and S=8
* The state requester closes BARRIER and sets its digest to 5 and its state to 8
* Since the digest is only 5 for P, the state requester asks P for retransmission of
messages 6, 7 and 8
* Messages 6, 7 and 8 from P are received and applied to the state
* The assumption here is that if messages 6, 7 and 8 are applied twice, the state
doesn't change (idempotency). This should be the case with Infinispan.
The advantage of this issue over JGRP-1352 is that we don't need to queue messages
for a long time if the state is large.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira