[jboss-jira] [JBoss JIRA] Commented: (JGRP-1229) Deadlock during flush

Sunday, 19 September 2010

    [
https://jira.jboss.org/browse/JGRP-1229?page=com.atlassian.jira.plugin.sy...
] 

Bela Ban commented on JGRP-1229:
--------------------------------

I think that making ABORT-FLUSH an OOB message is OK. However, we cannot make the other
flush messages OOB, as they need to be received in order.

Vladimir, can you go over your design and make sure changing ABORT-FLUSH to OOB
doesn't cause any bad behavior ?

...
 Deadlock during flush
 ---------------------

                 Key: JGRP-1229
                 URL: https://jira.jboss.org/browse/JGRP-1229
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.10
         Environment: Windows Vista, JDK 1.6.0_10, JGroups 2.10.0, JGroups config:
flush-tcp.xml with TCPGOSSIP
            Reporter: Markus Hampel
            Assignee: Vladimir Blagojevic
             Fix For: 2.11

 In my environment a deadlock during flush occurs under following circumstances: A message
is processed wich sends another multicast message. During the processing of the message
GMS starts a flush. The sending of the multicast message blocks and the flush fails
because the processed message doesn't end. The ABORT_FLUSH message isn't processed
because it's not a OOB message. The blocked message never unblocks.
 The following Threaddump shows the GMS thread and the message processing thread:
 "ViewHandler,Test,#A" prio=6 tid=0x04015c00 nid=0x1e8 waiting on condition
[0x04b0f000]
    java.lang.Thread.State: TIMED_WAITING (sleeping)
         at java.lang.Thread.sleep(Native Method)
         at org.jgroups.util.Util.sleep(Util.java:1298)
         at org.jgroups.util.Util.sleepRandom(Util.java:1377)
         at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:721)
         at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:694)
         at
org.jgroups.protocols.pbcast.CoordGmsImpl.handleMembershipChange(CoordGmsImpl.java:189)
         at org.jgroups.protocols.pbcast.GMS$ViewHandler.process(GMS.java:1390)
         at org.jgroups.protocols.pbcast.GMS$ViewHandler.run(GMS.java:1344)
         at java.lang.Thread.run(Thread.java:619)
 "Incoming-4,Test,#A" prio=6 tid=0x04014000 nid=0x11f4 waiting on condition
[0x0497e000]
    java.lang.Thread.State: TIMED_WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for  <0x248b2cd8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
         at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
         at org.jgroups.protocols.pbcast.FLUSH.blockMessageDuringFlush(FLUSH.java:321)
         at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:254)
         at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:894)
         at org.jgroups.JChannel.down(JChannel.java:1623)
         at org.jgroups.JChannel.send(JChannel.java:724)
         ... (Usercode)
         at
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:640)
         at
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
         at org.jgroups.JChannel.up(JChannel.java:1453)
         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
         at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:435)
         at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:151)
         at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
         at org.jgroups.protocols.FC.up(FC.java:474)
         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
         at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
         at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
         at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
         at org.jgroups.protocols.FD.up(FD.java:266)
         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
         at org.jgroups.protocols.Discovery.up(Discovery.java:281)
         at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
         at org.jgroups.protocols.TP.access$100(TP.java:56)
         at org.jgroups.protocols.TP$3.run(TP.java:933)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:619)
 DEBUG output:
   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
   FEIN: new=[#C], suspected=[], leaving=[], new view: [#A|1] [#A, #C]
   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: flush coordinator  is starting FLUSH with participants [#A]
   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: received START_FLUSH, responded with FLUSH_COMPLETED to #A
   19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: blocking for 10000ms
   19.08.2010 13:47:43 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: timed out waiting for flush responses after 2000 ms. Rejecting flush to
participants [#A]
   19.08.2010 13:47:51 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: blocking for 10000ms
   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl warn
   WARNUNG: #A: GMS flush by coordinator failed
   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
   FEIN: resuming message garbage collection
   19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
   FEIN: new=[#B], suspected=[], leaving=[], new view: [#A|2] [#A, #C, #B]
   19.08.2010 13:48:01 org.jgroups.logging.JDKLogImpl debug
   FEIN: #A: blocking for 10000ms
   19.08.2010 13:48:06 org.jgroups.logging.JDKLogImpl warn
   WARNUNG: #A: GMS flush by coordinator failed
   ...
 My solution (that works in my environment, but I can't overlook if it's
correct):
   Changes to org.jgroups.protocols.pbcast.FLUSH

     // 1. set the OOB flag for the ABORT_FLUSH message
     552 private void rejectFlush(Collection<? extends Address> participants, long
viewId) {
     553   for (Address flushMember : participants) {
     554     Message reject = new Message(flushMember, localAddress, null);
     555!    reject.setFlag(Message.OOB);
     // 2. the processing of the ABORT_FLUSH message has at least to set
"isBlockingFlushDown = false"
     //     but I copied the complete implementation from the "onStopFlush"
method
     336  public Object up(Event evt) {
     ...
     368    case FlushHeader.ABORT_FLUSH:
     ...
     379!      synchronized (sharedLock) {
     380!        flushCompletedMap.clear();
     381!        flushNotCompletedMap.clear();
     382!        flushMembers.clear();
     383!        suspected.clear();
     384!        flushCoordinator = null;
     385!        flushCompleted = false;
     386!      }
     387!      blockMutex.lock();
     388!      try {
     389!        isBlockingFlushDown = false;
     390!        notBlockedDown.signalAll();
     391!      } finally {
     392!        blockMutex.unlock();
     393!      }
     394!      flushInProgress.set(false); 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Commented: (JGRP-1229) Deadlock during flush