[jboss-jira] [JBoss JIRA] (JGRP-1674) STOP_FLUSH race condition
Nikolay Martynov (JIRA)
jira-events at lists.jboss.org
Wed Aug 21 08:22:26 EDT 2013
[ https://issues.jboss.org/browse/JGRP-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798204#comment-12798204 ]
Nikolay Martynov commented on JGRP-1674:
----------------------------------------
I can see this issue with 3.4.Aplha2 from master. The following is the log on a second node joining the cluster (with 1 node). You can see that it ignores flush end it doesnt recognize and then timeout on it's own flush:
2013-08-21 15:57:49,402 [main] DEBUG org.jgroups.protocols.pbcast.GMS - martinon1vm1-53554: installing [martinon1vm1-24496|1] [martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,403 [main] DEBUG org.jgroups.protocols.FD_SOCK - VIEW_CHANGE received: [martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,404 [main] DEBUG org.infinispan.remoting.transport.jgroups.JGroupsTransport - New view accepted: [martinon1vm1-24496|1] [martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,405 [main] INFO org.infinispan.remoting.transport.jgroups.JGroupsTransport - ISPN000094: Received new cluster view: [martinon1vm1-24496|1] [martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,406 [main] INFO com.hp.openmediation.proto.Application - /// Event EventImpl{type=VIEW_CHANGED, newMembers=[martinon1vm1-24496, martinon1vm1-53554], oldMembers=null, localAddress=martinon1vm1-53554, viewId=1, subgroupsMerged=null, mergeView=false} (type=VIEW_CHANGED)
2013-08-21 15:57:49,406 [main] DEBUG org.jgroups.protocols.pbcast.FLUSH - martinon1vm1-53554: installing view [martinon1vm1-24496|1] [martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,408 [FD_SOCK pinger,ISPN,martinon1vm1-53554] DEBUG org.jgroups.protocols.FD_SOCK - ping_dest is martinon1vm1-24496, pingable_mbrs=[martinon1vm1-24496, martinon1vm1-53554]
2013-08-21 15:57:49,409 [Incoming-2,ISPN,martinon1vm1-53554] DEBUG org.jgroups.protocols.pbcast.FLUSH - Received STOP_FLUSH at martinon1vm1-53554 but I am not flush participant, not responding
2013-08-21 15:57:51,408 [main] WARN org.jgroups.protocols.pbcast.FLUSH - martinon1vm1-53554: waiting for UNBLOCK timed out after 2000 ms
With many nodes joining the cluster this results in lots of blocks and leads to missing replicas in infinispan.
> STOP_FLUSH race condition
> -------------------------
>
> Key: JGRP-1674
> URL: https://issues.jboss.org/browse/JGRP-1674
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6.21
> Reporter: Dennis Reed
> Assignee: Bela Ban
> Fix For: 3.4
>
>
> There is a race condition in STOP_FLUSH when a node joins the cluster.
> JOINER sends JOIN_REQ to MASTER
> MASTER does a flush on the existing members (does NOT include JOINER)
> MASTER sends JOIN_RSP
> MASTER sends STOP_FLUSH
> JOINER receives JOIN_RSP
> JOINER fetches state, sends START_FLUSH
> JOINER receives STOP_FLUSH from MASTER (does not apply, since JOINER was not part of the original FLUSH)
> onStopFlush never verifies that the current node was part of the FLUSH, and therefore is valid for the current node.
> This STOP_FLUSH corrupts JOINER's FLUSH by resetting all the member variables (and probably unblocking as well).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list