]
Bela Ban commented on JGRP-1449:
--------------------------------
OK, so the above scenario should be prevented by resending the messages one-by-one, and
only sending the next message in the forward-table when the previous message was removed
(a message is removed when the coord broadcast it and we delivered it).
Take a look at branch JGRP-1449.reopen (method resendMessagesInForwardTable()) and let me
know what you think...
To be honest, I don't like the code :-) It is too complex, so I'll try to think
about a simpler and more elegant solution.
SEQUENCER race leads to lost messages
-------------------------------------
Key: JGRP-1449
URL:
https://issues.jboss.org/browse/JGRP-1449
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.9
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.1
I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
It looks as though there's some sort of race where:
- a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER
hasn't yet seen the view change
- meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view
change, and so SEQUENCER there forwards messages to the new coordinator
- but because the new coordinator doesn't yet know that it is coordinator, we hit
the problem above.
The messages don't ever get retransmitted; so they're simply lost.
Here's some trace from the member who drops the message, with the line from my
application showing that he does indeed become coordinator a few milliseconds later:
2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
And here's trace from the other end, showing that message being broadcast in the new
view.
2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO
c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
I can try to get lower level trace from JGroups if that would help.
I'm using the same stack as in JGRP-1443.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: