[jboss-jira] [JBoss JIRA] Commented: (JGRP-1229) Deadlock during flush

Mon Sep 20 00:57:28 EDT 2010

    [ https://jira.jboss.org/browse/JGRP-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12551747#action_12551747 ] 

Bela Ban commented on JGRP-1229:
--------------------------------

I think that making ABORT-FLUSH an OOB message is OK. However, we cannot make the other flush messages OOB, as they need to be received in order.

Vladimir, can you go over your design and make sure changing ABORT-FLUSH to OOB doesn't cause any bad behavior ?

> Deadlock during flush
> ---------------------
>
>                 Key: JGRP-1229
>                 URL: https://jira.jboss.org/browse/JGRP-1229
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.10
>         Environment: Windows Vista, JDK 1.6.0_10, JGroups 2.10.0, JGroups config: flush-tcp.xml with TCPGOSSIP
>            Reporter: Markus Hampel
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.11
>
>
> In my environment a deadlock during flush occurs under following circumstances: A message is processed wich sends another multicast message. During the processing of the message GMS starts a flush. The sending of the multicast message blocks and the flush fails because the processed message doesn't end. The ABORT_FLUSH message isn't processed because it's not a OOB message. The blocked message never unblocks.
> The following Threaddump shows the GMS thread and the message processing thread:
> "ViewHandler,Test,#A" prio=6 tid=0x04015c00 nid=0x1e8 waiting on condition [0x04b0f000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.jgroups.util.Util.sleep(Util.java:1298)
>         at org.jgroups.util.Util.sleepRandom(Util.java:1377)
>         at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:721)
>         at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:694)
>         at org.jgroups.protocols.pbcast.CoordGmsImpl.handleMembershipChange(CoordGmsImpl.java:189)
>         at org.jgroups.protocols.pbcast.GMS$ViewHandler.process(GMS.java:1390)
>         at org.jgroups.protocols.pbcast.GMS$ViewHandler.run(GMS.java:1344)
>         at java.lang.Thread.run(Thread.java:619)
> "Incoming-4,Test,#A" prio=6 tid=0x04014000 nid=0x11f4 waiting on condition [0x0497e000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x248b2cd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
>         at org.jgroups.protocols.pbcast.FLUSH.blockMessageDuringFlush(FLUSH.java:321)
>         at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:254)
>         at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:894)
>         at org.jgroups.JChannel.down(JChannel.java:1623)
>         at org.jgroups.JChannel.send(JChannel.java:724)
>         ... (Usercode)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:640)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
>         at org.jgroups.JChannel.up(JChannel.java:1453)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
>         at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:435)
>         at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:151)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>         at org.jgroups.protocols.FC.up(FC.java:474)
>         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
>         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
>         at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
>         at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
>         at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
>         at org.jgroups.protocols.FD.up(FD.java:266)
>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
>         at org.jgroups.protocols.Discovery.up(Discovery.java:281)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
>         at org.jgroups.protocols.TP.access$100(TP.java:56)
>         at org.jgroups.protocols.TP$3.run(TP.java:933)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> DEBUG output:
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: new=[#C], suspected=[], leaving=[], new view: [#A|1] [#A, #C]
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: flush coordinator  is starting FLUSH with participants [#A]
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: received START_FLUSH, responded with FLUSH_COMPLETED to #A
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:47:43 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: timed out waiting for flush responses after 2000 ms. Rejecting flush to participants [#A]
>   19.08.2010 13:47:51 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl warn
>   WARNUNG: #A: GMS flush by coordinator failed
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
>   FEIN: resuming message garbage collection
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
>   FEIN: new=[#B], suspected=[], leaving=[], new view: [#A|2] [#A, #C, #B]
>   19.08.2010 13:48:01 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:48:06 org.jgroups.logging.JDKLogImpl warn
>   WARNUNG: #A: GMS flush by coordinator failed
>   ...
> My solution (that works in my environment, but I can't overlook if it's correct):
>   Changes to org.jgroups.protocols.pbcast.FLUSH
>     
>     // 1. set the OOB flag for the ABORT_FLUSH message
>     552 private void rejectFlush(Collection<? extends Address> participants, long viewId) {
>     553   for (Address flushMember : participants) {
>     554     Message reject = new Message(flushMember, localAddress, null);
>     555!    reject.setFlag(Message.OOB);
>     // 2. the processing of the ABORT_FLUSH message has at least to set "isBlockingFlushDown = false"
>     //     but I copied the complete implementation from the "onStopFlush" method
>     336  public Object up(Event evt) {
>     ...
>     368    case FlushHeader.ABORT_FLUSH:
>     ...
>     379!      synchronized (sharedLock) {
>     380!        flushCompletedMap.clear();
>     381!        flushNotCompletedMap.clear();
>     382!        flushMembers.clear();
>     383!        suspected.clear();
>     384!        flushCoordinator = null;
>     385!        flushCompleted = false;
>     386!      }
>     387!      blockMutex.lock();
>     388!      try {
>     389!        isBlockingFlushDown = false;
>     390!        notBlockedDown.signalAll();
>     391!      } finally {
>     392!        blockMutex.unlock();
>     393!      }
>     394!      flushInProgress.set(false);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira