[
https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin....
]
Bela Ban edited comment on JGRP-1449 at 6/6/12 7:35 AM:
--------------------------------------------------------
David,
I hope this last fix now finally puts JGRP-1449 to rest ! :-)
As you noticed I put quite a lot of time into improving SEQUENCER, although it is
currently not really used by any of our customers. The main reason is that we're
looking into using total ordering in our Infinispan project, to disseminate changes and
perform state transfer in a *collision-free* way. That was a good reason to brush the dust
off of SEQUENCER and improve it.
You testing SEQUENCER was certainly also important, to find out all of those edge cases !
Assuming that you use or are going to use SEQUENCER in production, I think it would be
great if you could provide unit tests for the edge cases / issues you run into. This would
help us, and it would certainly also help you by getting a more robust and correct
SEQUENCER implementation, catching regressions early on.
I recently started using byteman (
byteman.org) to write unit tests which have scenarios
that would require code changes to the protocols, such as SEQUENCER. Using byteman, I
don't have to change code, but I make use of BM's ability to change code on the
fly, using an agent.
For example, for this issue, I create
SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script
testResendingVersusNewmessages.btm.
The script intercepts the view change and sends new messages, and the test ensures that
those new messages are delivered *after* the messages in the forward-table.
I think the scenario you describe in your comment dated June 5 12:16 would be an excellent
candidate for such a BM-based unit test.
was (Author: belaban):
David,
I hope this last fix now finally puts JGRP-1449 to rest ! :-)
As you noticed I put quite a lot of time into improving SEQUENCER, although it is
currently not really used by any of our customers. The main reason is that we're
looking into using total ordering in our Infinispan project, to disseminate changes and
perform state transfer in a *collision-free* way. That was a good reason to brush the dust
off of SEQUENCER and improve it.
You testing SEQUENCER was certainly also important, to find out all of those egde cases !
Assuming that you use or are going to use SEQUENCER in production, I think it would be
great if you could provide unit tests for the edge cases / issues you run into. This would
help us, and it would certainly also help you by getting a more robust and correct
SEQUENCER implementation, catching regressions early on.
I recently started using byteman (
byteman.org) to write unit tests which have scenarios
that would require code changes to the protocols, such as SEQUENCER. Using byteman, I
don't have to change code, but I make use of BM's ability to change code on the
fly, using an agent.
For example, for this issue, I create
SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script
testResendingVersusNewmessages.btm.
The script intercepts the view change and sends new messages, and the test ensures that
those new messages are delivered *after* the messages in the forward-table.
I think the scenario you describe in your comment dated June 5 12:16 would be an excellent
candidate for such a BM-based unit test.
SEQUENCER race leads to lost messages
-------------------------------------
Key: JGRP-1449
URL:
https://issues.jboss.org/browse/JGRP-1449
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.9
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.1
I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
It looks as though there's some sort of race where:
- a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER
hasn't yet seen the view change
- meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view
change, and so SEQUENCER there forwards messages to the new coordinator
- but because the new coordinator doesn't yet know that it is coordinator, we hit
the problem above.
The messages don't ever get retransmitted; so they're simply lost.
Here's some trace from the member who drops the message, with the line from my
application showing that he does indeed become coordinator a few milliseconds later:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
And here's trace from the other end, showing that message being broadcast in the new
view.
2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO
c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
I can try to get lower level trace from JGroups if that would help.
I'm using the same stack as in JGRP-1443.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira