[
https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1449:
--------------------------------
{quote}
I agree that unit tests would be a good thing for the bugs that we're finding. I
don't honestly know that I'm likely to find the time anytime soon - but byteman is
new to me and I do like finding new toys to play with; so never say never. Certainly if I
hit any further problems I'll give this some serious thought.
{quote}
Yes, please do so. A failing unit test definitely describes a bug much better than a
textual description, with some (unreadable) trace level logs attached ... :-)
{quote}
Contrariwise... I've been finding all these bugs (not just in SEQUENCER) through a
form of testing that I guess isn't part of your regular suite. Per the attachment to
JGRP-1451 I'm setting up fully fledged (albeit simplistic) applications and doing
nasty things to them in a random way, for many many hours. I hope you'll agree that
this has been quite a productive line of attack. Maybe you'd find it worthwhile to do
something similar yourself?
{quote}
I'm afraid I lack the time to do this; to a certain degree our QA department does this
(with smartfrog etc), but ultimately I get bug reports when someone catches an edge case,
from people like you. Then I try to capture the scenario in a unit test, to prevent future
regressions and write a fix for it.
As you can tell, SEQUENCER has never been used by a lot of people, that's why you ran
into some bugs. In the near to medium term, the quality of SEQUENCER will certainly
increase, especially if it becomes part of an official configuration for a product such as
JDG or EAP.
SEQUENCER race leads to lost messages
-------------------------------------
Key: JGRP-1449
URL:
https://issues.jboss.org/browse/JGRP-1449
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.9
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.1
I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
It looks as though there's some sort of race where:
- a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER
hasn't yet seen the view change
- meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view
change, and so SEQUENCER there forwards messages to the new coordinator
- but because the new coordinator doesn't yet know that it is coordinator, we hit
the problem above.
The messages don't ever get retransmitted; so they're simply lost.
Here's some trace from the member who drops the message, with the line from my
application showing that he does indeed become coordinator a few milliseconds later:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
And here's trace from the other end, showing that message being broadcast in the new
view.
2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO
c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
I can try to get lower level trace from JGroups if that would help.
I'm using the same stack as in JGRP-1443.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira