[jboss-jira] [JBoss JIRA] (JGRP-1449) SEQUENCER race leads to lost messages

Wed Jun 6 10:38:18 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699436#comment-12699436 ] 

Bela Ban commented on JGRP-1449:
--------------------------------

{quote}
I agree that unit tests would be a good thing for the bugs that we're finding. I don't honestly know that I'm likely to find the time anytime soon - but byteman is new to me and I do like finding new toys to play with; so never say never. Certainly if I hit any further problems I'll give this some serious thought.
{quote}

Yes, please do so. A failing unit test definitely describes a bug much better than a textual description, with some (unreadable) trace level logs attached ... :-)

{quote}
Contrariwise... I've been finding all these bugs (not just in SEQUENCER) through a form of testing that I guess isn't part of your regular suite. Per the attachment to JGRP-1451 I'm setting up fully fledged (albeit simplistic) applications and doing nasty things to them in a random way, for many many hours. I hope you'll agree that this has been quite a productive line of attack. Maybe you'd find it worthwhile to do something similar yourself?
{quote}

I'm afraid I lack the time to do this; to a certain degree our QA department does this (with smartfrog etc), but ultimately I get bug reports when someone catches an edge case, from people like you. Then I try to capture the scenario in a unit test, to prevent future regressions and write a fix for it.

As you can tell, SEQUENCER has never been used by a lot of people, that's why you ran into some bugs. In the near to medium term, the quality of SEQUENCER will certainly increase, especially if it becomes part of an official configuration for a product such as JDG or EAP.

> SEQUENCER race leads to lost messages
> -------------------------------------
>
>                 Key: JGRP-1449
>                 URL: https://issues.jboss.org/browse/JGRP-1449
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.9
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
> 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03
> It looks as though there's some sort of race where:
> -  a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change
> -  meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator
> -  but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above.
> The messages don't ever get retransmitted; so they're simply lost.
> Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later:
> 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03
> 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO  c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
> And here's trace from the other end, showing that message being broadcast in the new view.
> 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO  c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
> 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO  c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
> I can try to get lower level trace from JGroups if that would help.  
> I'm using the same stack as in JGRP-1443.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira