[jboss-jira] [JBoss JIRA] (JGRP-1449) SEQUENCER race leads to lost messages

Thursday, 31 May 2012



    [
https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin....
] 

Bela Ban commented on JGRP-1449:
--------------------------------

I think we need to prevent another error scenario as well:
- The view is {A,B,C}, A's the coord
- C forwards 1-5 to A, which broadcasts 1-3, then crashes
- A new view {B,C} is installed
- C now sends another 5 messages (6-10), in parallel to resending messages 4-5
- If the new coord B receives 6-10 first, it'll set the expected seqno to 7 on
reception of 6. When it receives 4-5, it'll drop them. Only when 4-5 are received
*before* 6-10 will 4-5 get delivered

SOLUTION:
- We have to make sure the messages in the forward-queue are sent (*and* received, in
order to be removed from the forward-queue) *before* new messages
- To that end, we could block the sending of messages after a view change until the
forward-queue is empty
- The forwarding and emptying of messages in the forward-queue could be done in a loop,
with sleeps in between
- This eliminates the need for a timer task
                
...
 SEQUENCER race leads to lost messages
 -------------------------------------

                 Key: JGRP-1449
                 URL: https://issues.jboss.org/browse/JGRP-1449
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.0.9
            Reporter: David Hotham
            Assignee: Bela Ban
             Fix For: 3.1


 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this:
 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
 It looks as though there's some sort of race where:
 -  a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER
hasn't yet seen the view change
 -  meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view
change, and so SEQUENCER there forwards messages to the new coordinator
 -  but because the new coordinator doesn't yet know that it is coordinator, we hit
the problem above.
 The messages don't ever get retransmitted; so they're simply lost.
 Here's some trace from the member who drops the message, with the line from my
application showing that he does indeed become coordinator a few milliseconds later:
 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR
org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request
from CFS-A-pisces-cfs03
 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO 
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
 And here's trace from the other end, showing that message being broadcast in the new
view.
 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO 
c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86]
[CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02],
[CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02]
 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO 
c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg
 I can try to get lower level trace from JGroups if that would help.  
 I'm using the same stack as in JGRP-1443. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1449) SEQUENCER race leads to lost messages