[JBoss JIRA] Created: (JGRP-608) FLUSH not safe in simultaneous flush situation

[JBoss JIRA] Created: (JBAS-4940)...

[JBoss JIRA] Created: (JGRP-618)...

Michael Newcomb (JIRA)

Tuesday, 23 October 2007 Tue, 23 Oct '07

10:38 a.m.

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Bela Ban I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Show replies by date

Bela Ban (JIRA)

Tuesday, 23 October Tue, 23 Oct

10:42 a.m.

New subject: [JBoss JIRA] Assigned: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Bela Ban reassigned JGRP-608: ----------------------------- Assignee: Vladimir Blagojevic (was: Bela Ban)

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

-- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (JIRA)

10:42 a.m.

New subject: [JBoss JIRA] Updated: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Bela Ban updated JGRP-608: -------------------------- Fix Version/s: 2.6 Priority: Critical (was: Major)

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

10:54 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12383943 ] Vladimir Blagojevic commented on JGRP-608: ------------------------------------------ Michael, I was aware of the these problems but as we were in the process of simplifying flush (JGRP-598) I never got around to verify how concurrent flush works with a revised FLUSH version. Would you please verify FLUSH.java version 1.75 that I checked in this morning and see if you still experience these problems. Vladimir

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

11:30 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12383946 ] Michael Newcomb commented on JGRP-608: -------------------------------------- I will check it out. Upon initial inspection: 470 if(!flushInProgress){ 471 flushInProgress = true; 472 sendBlockUpToChannel(); 473 onStartFlush(msg.getSrc(), fh); You might want to use an AtomicBoolean for flushInProgress and use flushInProgress.compareAndSet(false, true) as your test. As it is, it might lead to more problems because of multiple blocks going up the channel. I do see the second problem of not notifying the winning requestor that his flush is going to be accepted has been fixed by calling onStartFlush(). 484 if(flushRequester.compareTo(coordinator) < 0){ 485 rejectFlush(fh.viewID, coordinator); 486 if(log.isDebugEnabled()){ 487 log.debug("Rejecting flush at " + localAddress 488 + " to current flush coordinator " 489 + coordinator 490 + " and switching flush coordinator to " 491 + flushRequester); 492 } 493 onStartFlush(flushRequester, fh); However, the situation still exists where the flushCoordinator may not be set by the winner of the initial flushInProgress = true and another flush requestor could execute the following and flushCoordinator wasn't set yet. 477 synchronized(sharedLock){ 478 if(flushCoordinator != null) 479 coordinator = flushCoordinator; 480 else 481 coordinator = flushRequester; 482 } So, in that sense you really can't even use a boolean, you need to execute some code that sets the flushCoordinator (and other onStartFlush stuff)while checking to see if a flush is in progress (all in a protected section of code).

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

1:35 p.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12383974 ] Vladimir Blagojevic commented on JGRP-608: ------------------------------------------ Yes you are right. flushCoordinator and flushInProgress should be in the same protected section of the code.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

Wednesday, 24 October Wed, 24 Oct

8:54 a.m.

New subject: [JBoss JIRA] Resolved: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Vladimir Blagojevic resolved JGRP-608. -------------------------------------- Resolution: Done Commit set: FLUSH.java 1.76

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

Thursday, 25 October Thu, 25 Oct

1:20 p.m.

New subject: [JBoss JIRA] Reopened: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Vladimir Blagojevic reopened JGRP-608: -------------------------------------- Lock scope is too big. There are tests failures with this "fix".

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

Tuesday, 30 October Tue, 30 Oct

8:36 a.m.

New subject: [JBoss JIRA] Resolved: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Vladimir Blagojevic resolved JGRP-608. -------------------------------------- Resolution: Done Smaller lock scope. Commit set: protocols/pbcast/FLUSH.java 1.77

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

Wednesday, 7 November Wed, 7 Nov

10:19 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386509 ] Michael Newcomb commented on JGRP-608: -------------------------------------- Vladimir, I think there is still an issue in a simultaneous flush situation. I am seeing block->unblock, block->unblock like I expect, but occasionally, I see block->unblock->unblock and that screws my software up. It's almost as if 2 flushes succeeded or something. I am still researching, but it seems that simultaneous flushes are interfering with one another somehow.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

11:21 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386543 ] Vladimir Blagojevic commented on JGRP-608: ------------------------------------------ Ok. Lets reproduce it if possible.I'll abstract a unit test that attempts to recreate this scenario. Look at FlushTest that we have in a testsuite. It checks for event sequence. What we need is a version of ConcurrentStartupTest and ConcurrentStateTransferTest that checks for event sequence. I'll see to add it today. Have you had a look at the code changes?

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

11:34 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386547 ] Michael Newcomb commented on JGRP-608: -------------------------------------- I was working on 1.77 and am in the process of getting the latest from CVS (I see there is some changes). I will try to reproduce it (sometimes hard because of timing issues) with the latest version from CVS.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

11:41 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386551 ] Vladimir Blagojevic commented on JGRP-608: ------------------------------------------ Current version is 1.81. I looked again at FlushTest and it already tests so called concurrent flush scenario. All channels are spawned with 1 sec interval and for state transfers all latches are released at the same moment thus creating concurrent flush scenario. And we do check for event sequence in these tests. I'll try to run these tests repeatedly to see if I can recreate unblock,unblock sequence that you mention.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

12:18 p.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386560 ] Michael Newcomb commented on JGRP-608: -------------------------------------- 197 private boolean startFlush(Event evt, int numberOfAttempts, boolean isRetry) { 198 boolean successfulFlush = false; 199 if(!flushInProgress.get() || isRetry){ 200 flush_promise.reset(); While reviewing the new code, I came across the above in FLUSH.java. Does isRetry really outweigh the fact that a flush is in progress? If a flush is in progress, is there a need to retry starting a flush? Also, further down, it is possible for a flush to complete sometime after the initial timeout waiting for the flush to complete and the time the flush is attempted again. In fact, a flush attempt that timed out, could even complete a flush has just been reset (line 200). I think that flush messages need to be tracked by member/viewid/flush# or something. So, each flush message can be checked against a flush 'id'. Also, I think that having a flush 'object' tracking all the participants, acknowledged members, and even the rejected members would also aid in the debugging process. Anyway, I'm taking the latest code into our lab and will report back soon.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

1:24 p.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386573 ] Michael Newcomb commented on JGRP-608: -------------------------------------- Got the latest code to do 2 unblocks in a row. It seemed it did it when it was switching from one coordinator to another, because a flush requestor was '<' than the current flush coordinator, it is switched. Also, my programs do the following: 1. channel.connect() 2. channel.getState() So, there are flushes going out for the 'connect()' and flushes going out for the 'getState()' and I think the flush path for views might be clashing with the flush path for normal JChannel.startFlush/stopFlush calls... Still researching...

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

1:41 p.m.

New subject: [JBoss JIRA] Reopened: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Vladimir Blagojevic reopened JGRP-608: -------------------------------------- Reopening to investigate further.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Michael Newcomb (JIRA)

Friday, 9 November Fri, 9 Nov

8:16 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386842 ] Michael Newcomb commented on JGRP-608: -------------------------------------- JGRP-618 seems to have resolved this. Well, I no longer see consecutive 'unblock' calls with 5 simultaneous joining/state-requesting group members. As a side note, do you have an estimate of when connect-with-state-transfer will be done? I could really use it for my application. Thanks again, Michael

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

Vladimir Blagojevic (JIRA)

8:42 a.m.

New subject: [JBoss JIRA] Resolved: (JGRP-608) FLUSH not safe in simultaneous flush situation

[ http://jira.jboss.com/jira/browse/JGRP-608?page=all ] Vladimir Blagojevic resolved JGRP-608. -------------------------------------- Resolution: Done Michael, Excellent! Connect and state transfer is already implemented. Give it a try. Cheers.

...

FLUSH not safe in simultaneous flush situation ---------------------------------------------- Key: JGRP-608 URL: http://jira.jboss.com/jira/browse/JGRP-608 Project: JGroups Issue Type: Feature Request Affects Versions: 2.6 Reporter: Michael Newcomb Assigned To: Vladimir Blagojevic Priority: Critical Fix For: 2.6 I'm running into a few problems when multiple members request a FLUSH at the same time. I am still in the process of analyzing the situation, but here are a few problems: private void handleStartFlush(Message msg, FlushHeader fh) { byte oldPhase = flushPhase.transitionToFirstPhase(); if(oldPhase == FlushPhase.START_PHASE){ sendBlockUpToChannel(); onStartFlush(msg.getSrc(), fh); }else if(oldPhase == FlushPhase.FIRST_PHASE){ Address flushRequester = msg.getSrc(); Address coordinator = null; synchronized(sharedLock){ if(flushCoordinator != null) coordinator = flushCoordinator; else coordinator = flushRequester; } After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase(). IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase(). The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem. Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here. If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight. So the following code handles that situation... private void handleStartFlush(Message msg, FlushHeader fh) { ... }else if(oldPhase == FlushPhase.FIRST_PHASE){ ... if(flushRequester.compareTo(coordinator) < 0){ rejectFlush(fh.viewID, coordinator); ... synchronized(sharedLock){ flushCoordinator = flushRequester; } } If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well. I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed. I will be testing a patch this afternoon and hope to report back.

6263

days inactive

6280

days old

jboss-jira@lists.jboss.org

Manage subscription

17 comments

3 participants

tags (0)

participants (3)

Bela Ban (JIRA)
Michael Newcomb (JIRA)
Vladimir Blagojevic (JIRA)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JGRP-608) FLUSH not safe in simultaneous flush situation