[jboss-jira] [JBoss JIRA] Updated: (JGRP-1229) Deadlock during flush

Thu Aug 19 08:23:11 EDT 2010

     [ https://jira.jboss.org/browse/JGRP-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban updated JGRP-1229:
---------------------------

         Assignee: Vladimir Blagojevic  (was: Bela Ban)
    Fix Version/s: 2.11


> Deadlock during flush
> ---------------------
>
>                 Key: JGRP-1229
>                 URL: https://jira.jboss.org/browse/JGRP-1229
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.10
>         Environment: Windows Vista, JDK 1.6.0_10, JGroups 2.10.0, JGroups config: flush-tcp.xml with TCPGOSSIP
>            Reporter: Markus Hampel
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.11
>
>
> In my environment a deadlock during flush occurs under following circumstances: A message is processed wich sends another multicast message. During the processing of the message GMS starts a flush. The sending of the multicast message blocks and the flush fails because the processed message doesn't end. The ABORT_FLUSH message isn't processed because it's not a OOB message. The blocked message never unblocks.
> The following Threaddump shows the GMS thread and the message processing thread:
> "ViewHandler,Test,#A" prio=6 tid=0x04015c00 nid=0x1e8 waiting on condition [0x04b0f000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.jgroups.util.Util.sleep(Util.java:1298)
>         at org.jgroups.util.Util.sleepRandom(Util.java:1377)
>         at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:721)
>         at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:694)
>         at org.jgroups.protocols.pbcast.CoordGmsImpl.handleMembershipChange(CoordGmsImpl.java:189)
>         at org.jgroups.protocols.pbcast.GMS$ViewHandler.process(GMS.java:1390)
>         at org.jgroups.protocols.pbcast.GMS$ViewHandler.run(GMS.java:1344)
>         at java.lang.Thread.run(Thread.java:619)
> "Incoming-4,Test,#A" prio=6 tid=0x04014000 nid=0x11f4 waiting on condition [0x0497e000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x248b2cd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
>         at org.jgroups.protocols.pbcast.FLUSH.blockMessageDuringFlush(FLUSH.java:321)
>         at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:254)
>         at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:894)
>         at org.jgroups.JChannel.down(JChannel.java:1623)
>         at org.jgroups.JChannel.send(JChannel.java:724)
>         ... (Usercode)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:640)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
>         at org.jgroups.JChannel.up(JChannel.java:1453)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
>         at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:435)
>         at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:151)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>         at org.jgroups.protocols.FC.up(FC.java:474)
>         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
>         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
>         at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
>         at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
>         at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
>         at org.jgroups.protocols.FD.up(FD.java:266)
>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
>         at org.jgroups.protocols.Discovery.up(Discovery.java:281)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
>         at org.jgroups.protocols.TP.access$100(TP.java:56)
>         at org.jgroups.protocols.TP$3.run(TP.java:933)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> DEBUG output:
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: new=[#C], suspected=[], leaving=[], new view: [#A|1] [#A, #C]
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: flush coordinator  is starting FLUSH with participants [#A]
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: received START_FLUSH, responded with FLUSH_COMPLETED to #A
>   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:47:43 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: timed out waiting for flush responses after 2000 ms. Rejecting flush to participants [#A]
>   19.08.2010 13:47:51 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl warn
>   WARNUNG: #A: GMS flush by coordinator failed
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
>   FEIN: resuming message garbage collection
>   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
>   FEIN: new=[#B], suspected=[], leaving=[], new view: [#A|2] [#A, #C, #B]
>   19.08.2010 13:48:01 org.jgroups.logging.JDKLogImpl debug
>   FEIN: #A: blocking for 10000ms
>   19.08.2010 13:48:06 org.jgroups.logging.JDKLogImpl warn
>   WARNUNG: #A: GMS flush by coordinator failed
>   ...
> My solution (that works in my environment, but I can't overlook if it's correct):
>   Changes to org.jgroups.protocols.pbcast.FLUSH
>     
>     // 1. set the OOB flag for the ABORT_FLUSH message
>     552 private void rejectFlush(Collection<? extends Address> participants, long viewId) {
>     553   for (Address flushMember : participants) {
>     554     Message reject = new Message(flushMember, localAddress, null);
>     555!    reject.setFlag(Message.OOB);
>     // 2. the processing of the ABORT_FLUSH message has at least to set "isBlockingFlushDown = false"
>     //     but I copied the complete implementation from the "onStopFlush" method
>     336  public Object up(Event evt) {
>     ...
>     368    case FlushHeader.ABORT_FLUSH:
>     ...
>     379!      synchronized (sharedLock) {
>     380!        flushCompletedMap.clear();
>     381!        flushNotCompletedMap.clear();
>     382!        flushMembers.clear();
>     383!        suspected.clear();
>     384!        flushCoordinator = null;
>     385!        flushCompleted = false;
>     386!      }
>     387!      blockMutex.lock();
>     388!      try {
>     389!        isBlockingFlushDown = false;
>     390!        notBlockedDown.signalAll();
>     391!      } finally {
>     392!        blockMutex.unlock();
>     393!      }
>     394!      flushInProgress.set(false);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira