]
Bela Ban commented on JGRP-1229:
--------------------------------
I think that making ABORT-FLUSH an OOB message is OK. However, we cannot make the other
flush messages OOB, as they need to be received in order.
Vladimir, can you go over your design and make sure changing ABORT-FLUSH to OOB
doesn't cause any bad behavior ?
Deadlock during flush
---------------------
Key: JGRP-1229
URL:
https://jira.jboss.org/browse/JGRP-1229
Project: JGroups
Issue Type: Bug
Affects Versions: 2.10
Environment: Windows Vista, JDK 1.6.0_10, JGroups 2.10.0, JGroups config:
flush-tcp.xml with TCPGOSSIP
Reporter: Markus Hampel
Assignee: Vladimir Blagojevic
Fix For: 2.11
In my environment a deadlock during flush occurs under following circumstances: A message
is processed wich sends another multicast message. During the processing of the message
GMS starts a flush. The sending of the multicast message blocks and the flush fails
because the processed message doesn't end. The ABORT_FLUSH message isn't processed
because it's not a OOB message. The blocked message never unblocks.
The following Threaddump shows the GMS thread and the message processing thread:
"ViewHandler,Test,#A" prio=6 tid=0x04015c00 nid=0x1e8 waiting on condition
[0x04b0f000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.jgroups.util.Util.sleep(Util.java:1298)
at org.jgroups.util.Util.sleepRandom(Util.java:1377)
at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:721)
at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:694)
at
org.jgroups.protocols.pbcast.CoordGmsImpl.handleMembershipChange(CoordGmsImpl.java:189)
at org.jgroups.protocols.pbcast.GMS$ViewHandler.process(GMS.java:1390)
at org.jgroups.protocols.pbcast.GMS$ViewHandler.run(GMS.java:1344)
at java.lang.Thread.run(Thread.java:619)
"Incoming-4,Test,#A" prio=6 tid=0x04014000 nid=0x11f4 waiting on condition
[0x0497e000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x248b2cd8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
at org.jgroups.protocols.pbcast.FLUSH.blockMessageDuringFlush(FLUSH.java:321)
at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:254)
at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:894)
at org.jgroups.JChannel.down(JChannel.java:1623)
at org.jgroups.JChannel.send(JChannel.java:724)
... (Usercode)
at
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:640)
at
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
at org.jgroups.JChannel.up(JChannel.java:1453)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:435)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:151)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
at org.jgroups.protocols.FC.up(FC.java:474)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
at org.jgroups.protocols.FD.up(FD.java:266)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
at org.jgroups.protocols.Discovery.up(Discovery.java:281)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
at org.jgroups.protocols.TP.access$100(TP.java:56)
at org.jgroups.protocols.TP$3.run(TP.java:933)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
DEBUG output:
19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
FEIN: new=[#C], suspected=[], leaving=[], new view: [#A|1] [#A, #C]
19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: flush coordinator is starting FLUSH with participants [#A]
19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: received START_FLUSH, responded with FLUSH_COMPLETED to #A
19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: blocking for 10000ms
19.08.2010 13:47:43 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: timed out waiting for flush responses after 2000 ms. Rejecting flush to
participants [#A]
19.08.2010 13:47:51 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: blocking for 10000ms
19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl warn
WARNUNG: #A: GMS flush by coordinator failed
19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
FEIN: resuming message garbage collection
19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
FEIN: new=[#B], suspected=[], leaving=[], new view: [#A|2] [#A, #C, #B]
19.08.2010 13:48:01 org.jgroups.logging.JDKLogImpl debug
FEIN: #A: blocking for 10000ms
19.08.2010 13:48:06 org.jgroups.logging.JDKLogImpl warn
WARNUNG: #A: GMS flush by coordinator failed
...
My solution (that works in my environment, but I can't overlook if it's
correct):
Changes to org.jgroups.protocols.pbcast.FLUSH
// 1. set the OOB flag for the ABORT_FLUSH message
552 private void rejectFlush(Collection<? extends Address> participants, long
viewId) {
553 for (Address flushMember : participants) {
554 Message reject = new Message(flushMember, localAddress, null);
555! reject.setFlag(Message.OOB);
// 2. the processing of the ABORT_FLUSH message has at least to set
"isBlockingFlushDown = false"
// but I copied the complete implementation from the "onStopFlush"
method
336 public Object up(Event evt) {
...
368 case FlushHeader.ABORT_FLUSH:
...
379! synchronized (sharedLock) {
380! flushCompletedMap.clear();
381! flushNotCompletedMap.clear();
382! flushMembers.clear();
383! suspected.clear();
384! flushCoordinator = null;
385! flushCompleted = false;
386! }
387! blockMutex.lock();
388! try {
389! isBlockingFlushDown = false;
390! notBlockedDown.signalAll();
391! } finally {
392! blockMutex.unlock();
393! }
394! flushInProgress.set(false);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: