[JBoss JIRA] (JGRP-1449) SEQUENCER race leads to lost messages

Bela Ban (JIRA)

Friday, 13 April Fri, 13 Apr

7:50 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1449: --------------------------- Fix Version/s: 3.1

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

David Hotham (JIRA)

10:45 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] David Hotham commented on JGRP-1449: ------------------------------------ Thanks. In this case the view change at the sender occurs before the message is broadcast at all - so I don't think there'll be even one re-submission, if I understand correctly. It sounds as though resubmits on a timer may be the way to go.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

Friday, 1 June Fri, 1 Jun

6:57 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1449: -------------------------------- Created a byteman-based test: SequencerFailoverTest, which injects new messages 3-10 into the system just before the view change (which triggers the resending of 1-2). The test currently fails, as expected, and shows messages 3-10, but not 1-2. The latter 2 are rejected as the next expected seqno is 11, as 3-10 are delivered first.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

9:20 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1449: -------------------------------- Please let me know if this works for you and possibly also review the code changes.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

9:52 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1449: -------------------------------- Sorry, didn't see my push was rejected before... Now, the push succeeded

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

10:35 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1449: -------------------------------- arrgh, you're right. Re-opening. The solution is to simply resend the messages in the forward-table until the forward-table is empty... or there is another view change. I actually mentioned this before, but forgot to implement it...

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

David Hotham (JIRA)

11:16 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] David Hotham commented on JGRP-1449: ------------------------------------ I'm also concerned by the change you've made to canDeliver() (per your comments in JGRP-1461, I think). Suppose: - application sends messages 3 and 4 (from the same thread; so from its point of view, strictly in that order) - scenario per my last comment, B drops message 3 - then B gets the new view before message 4 arrives - B accepts and re-broadcasts message 4 - now everyone receives and delivers message 4 even though they haven't yet received message 3 - (message 3 will be re-broadcast in due course, once we get this issue fixed up. But it can only ever arrive after message 4) So SEQUENCER now guarantees that everyone receives messages in the same order, but not that this is the order that they were sent by the application! What do you think?

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

David Hotham (JIRA)

4:18 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] David Hotham commented on JGRP-1449: ------------------------------------ Aha, very clever! I had not thought of that. Yes, this looks as though it should work. I'll set some tests running.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

6:35 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban resolved JGRP-1449. ---------------------------- Resolution: Done

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

6:37 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban edited comment on JGRP-1449 at 6/6/12 7:35 AM: -------------------------------------------------------- David, I hope this last fix now finally puts JGRP-1449 to rest ! :-) As you noticed I put quite a lot of time into improving SEQUENCER, although it is currently not really used by any of our customers. The main reason is that we're looking into using total ordering in our Infinispan project, to disseminate changes and perform state transfer in a *collision-free* way. That was a good reason to brush the dust off of SEQUENCER and improve it. You testing SEQUENCER was certainly also important, to find out all of those edge cases ! Assuming that you use or are going to use SEQUENCER in production, I think it would be great if you could provide unit tests for the edge cases / issues you run into. This would help us, and it would certainly also help you by getting a more robust and correct SEQUENCER implementation, catching regressions early on. I recently started using byteman (byteman.org) to write unit tests which have scenarios that would require code changes to the protocols, such as SEQUENCER. Using byteman, I don't have to change code, but I make use of BM's ability to change code on the fly, using an agent. For example, for this issue, I create SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script testResendingVersusNewmessages.btm. The script intercepts the view change and sends new messages, and the test ensures that those new messages are delivered *after* the messages in the forward-table. I think the scenario you describe in your comment dated June 5 12:16 would be an excellent candidate for such a BM-based unit test. was (Author: belaban): David, I hope this last fix now finally puts JGRP-1449 to rest ! :-) As you noticed I put quite a lot of time into improving SEQUENCER, although it is currently not really used by any of our customers. The main reason is that we're looking into using total ordering in our Infinispan project, to disseminate changes and perform state transfer in a *collision-free* way. That was a good reason to brush the dust off of SEQUENCER and improve it. You testing SEQUENCER was certainly also important, to find out all of those egde cases ! Assuming that you use or are going to use SEQUENCER in production, I think it would be great if you could provide unit tests for the edge cases / issues you run into. This would help us, and it would certainly also help you by getting a more robust and correct SEQUENCER implementation, catching regressions early on. I recently started using byteman (byteman.org) to write unit tests which have scenarios that would require code changes to the protocols, such as SEQUENCER. Using byteman, I don't have to change code, but I make use of BM's ability to change code on the fly, using an agent. For example, for this issue, I create SequencerFailoverTest.testResendingVersusNewMessages(), which uses the BM script testResendingVersusNewmessages.btm. The script intercepts the view change and sends new messages, and the test ensures that those new messages are delivered *after* the messages in the forward-table. I think the scenario you describe in your comment dated June 5 12:16 would be an excellent candidate for such a BM-based unit test.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

Bela Ban (JIRA)

9:38 a.m.

[ https://issues.jboss.org/browse/JGRP-1449?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1449: -------------------------------- {quote} I agree that unit tests would be a good thing for the bugs that we're finding. I don't honestly know that I'm likely to find the time anytime soon - but byteman is new to me and I do like finding new toys to play with; so never say never. Certainly if I hit any further problems I'll give this some serious thought. {quote} Yes, please do so. A failing unit test definitely describes a bug much better than a textual description, with some (unreadable) trace level logs attached ... :-) {quote} Contrariwise... I've been finding all these bugs (not just in SEQUENCER) through a form of testing that I guess isn't part of your regular suite. Per the attachment to JGRP-1451 I'm setting up fully fledged (albeit simplistic) applications and doing nasty things to them in a random way, for many many hours. I hope you'll agree that this has been quite a productive line of attack. Maybe you'd find it worthwhile to do something similar yourself? {quote} I'm afraid I lack the time to do this; to a certain degree our QA department does this (with smartfrog etc), but ultimately I get bug reports when someone catches an edge case, from people like you. Then I try to capture the scenario in a unit test, to prevent future regressions and write a fix for it. As you can tell, SEQUENCER has never been used by a lot of people, that's why you ran into some bugs. In the near to medium term, the quality of SEQUENCER will certainly increase, especially if it becomes part of an official configuration for a product such as JDG or EAP.

...

SEQUENCER race leads to lost messages ------------------------------------- Key: JGRP-1449 URL: https://issues.jboss.org/browse/JGRP-1449 Project: JGroups Issue Type: Bug Affects Versions: 3.0.9 Reporter: David Hotham Assignee: Bela Ban Fix For: 3.1 I'm seeing an issue where SEQUENCER is dropping messages, with logs like this: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 It looks as though there's some sort of race where: - a member ("CFS-B-pisces-cfs02") is becoming coordinator, but SEQUENCER hasn't yet seen the view change - meanwhile some other member ("CFS-A-pisces-cfs03") has been told of the view change, and so SEQUENCER there forwards messages to the new coordinator - but because the new coordinator doesn't yet know that it is coordinator, we hit the problem above. The messages don't ever get retransmitted; so they're simply lost. Here's some trace from the member who drops the message, with the line from my application showing that he does indeed become coordinator a few milliseconds later: 2012-04-13 10:05:21.363 [Incoming-1,pisces,CFS-B-pisces-cfs02] ERROR org.jgroups.protocols.SEQUENCER - CFS-B-pisces-cfs02: non-coord; dropping FORWARD request from CFS-A-pisces-cfs03 2012-04-13 10:05:21.393 [Incoming-2,pisces,CFS-B-pisces-cfs02] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] And here's trace from the other end, showing that message being broadcast in the new view. 2012-04-13 10:05:21.359 [Incoming-1,pisces,CFS-A-pisces-cfs03] INFO c.m.c.CommunicatorComponent$Communicator - New view: MergeView::[CFS-B-pisces-cfs02|86] [CFS-B-pisces-cfs02, CFS-B-pisces-cfs03, CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], subgroups=[CFS-A-pisces-cfs03|84] [CFS-A-pisces-cfs03, CFS-A-pisces-cfs02], [CFS-B-pisces-cfs03|85] [CFS-B-pisces-cfs03, CFS-A-pisces-cfs02, CFS-B-pisces-cfs02] 2012-04-13 10:05:21.361 [ForkJoinPool-1-worker-0] INFO c.m.c.CommunicatorComponent$Communicator - Broadcasting ClusterMgmtMsg I can try to get lower level trace from JGroups if that would help. I'm using the same stack as in JGRP-1443.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006