[jboss-jira] [JBoss JIRA] (JGRP-1449) SEQUENCER race leads to lost messages

Wed Jun 6 07:37:17 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699303#comment-12699303 ] 

Bela Ban edited comment on JGRP-1449 at 6/6/12 7:35 AM:
--------------------------------------------------------

David,

I hope this last fix now finally puts JGRP-1449 to rest ! :-)

As you noticed I put quite a lot of time into improving SEQUENCER, although it is currently not really used by any of our customers. The main reason is that we're looking into using total ordering in our Infinispan project, to disseminate changes and perform state transfer in a *collision-free* way. That was a good reason to brush the dust off of SEQUENCER and improve it.

You testing SEQUENCER was certainly also important, to find out all of those edge cases !

Assuming that you use or are going to use SEQUENCER in production, I think it would be great if you could provide unit tests for the edge cases / issues you run into. This would help us, and it would certainly also help you by getting a more robust and correct SEQUENCER implementation, catching regressions early on.

I recently started using byteman (byteman.org) to write unit tests which have scenarios that would require code changes to the protocols, such as SEQUENCER. Using byteman, I don't have to change code, but I make use of BM's ability to change code on the fly, using an agent.

For example, for this issue, I create SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script testResendingVersusNewmessages.btm.
The script intercepts the view change and sends new messages, and the test ensures that those new messages are delivered *after* the messages in the forward-table.
I think the scenario you describe in your comment dated June 5 12:16 would be an excellent candidate for such a BM-based unit test.

      was (Author: belaban):
    David,

I hope this last fix now finally puts JGRP-1449 to rest ! :-)

As you noticed I put quite a lot of time into improving SEQUENCER, although it is currently not really used by any of our customers. The main reason is that we're looking into using total ordering in our Infinispan project, to disseminate changes and perform state transfer in a *collision-free* way. That was a good reason to brush the dust off of SEQUENCER and improve it.

You testing SEQUENCER was certainly also important, to find out all of those egde cases !

Assuming that you use or are going to use SEQUENCER in production, I think it would be great if you could provide unit tests for the edge cases / issues you run into. This would help us, and it would certainly also help you by getting a more robust and correct SEQUENCER implementation, catching regressions early on.

I recently started using byteman (byteman.org) to write unit tests which have scenarios that would require code changes to the protocols, such as SEQUENCER. Using byteman, I don't have to change code, but I make use of BM's ability to change code on the fly, using an agent.

For example, for this issue, I create SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script testResendingVersusNewmessages.btm.
The script intercepts the view change and sends new messages, and the test ensures that those new messages are delivered *after* the messages in the forward-table.
I think the scenario you describe in your comment dated June 5 12:16 would be an excellent candidate for such a BM-based unit test.

> SEQUENCER race leads to lost messages
> -------------------------------------
>
>                 Key: JGRP-1449
>                 URL: https://issues.jboss.org/browse/JGRP-1449
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.9
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
> 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03
> It looks as though there's some sort of race where:
> -  a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change
> -  meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator
> -  but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above.
> The messages don't ever get retransmitted; so they're simply lost.
> Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later:
> 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03
> 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO  c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
> And here's trace from the other end, showing that message being broadcast in the new view.
> 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO  c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
> 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO  c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
> I can try to get lower level trace from JGroups if that would help.  
> I'm using the same stack as in JGRP-1443.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira