]
Yong Hao Gao commented on JBMESSAGING-1864:
-------------------------------------------
I think that making addingBinding() and performFailover() mutually exclusive should be a
simple fix.
It needs to create a new lock for this sole purpose.
Howard
Deadlock in creating consumer during failover time
--------------------------------------------------
Key: JBMESSAGING-1864
URL:
https://issues.jboss.org/browse/JBMESSAGING-1864
Project: JBoss Messaging
Issue Type: Bug
Components: JMS Clustering
Affects Versions: 1.4.0.SP3.CP12, 1.4.8.GA
Reporter: Yong Hao Gao
Assignee: Yong Hao Gao
Fix For: 1.4.0.SP3.CP13, 1.4.8.SP1
When a node is performing failover for the dead node, there is a possible condition where
a dead lock could happen if a new binding is created at that time (triggered by creating a
new subscription on a topic). To reproduce you need manually 'instrument' the code
to create a suitable timing:
1 set up a 2 node cluster node0 and node1
2 kill node0 and let the failover happen. Let it stops at
MessagingPostOffice.performFailover(), after the line
pm.mergeTransactions(failedNodeID.intValue(), thisNodeID);
So at this point the write lock is just about to be obtained.
3 create a consumer (must be a new subscriber) on node1, let the calling thread proceed
to MessagingPostOffice.addBindingInMemory(), before the line:
clusterNotifier.sendNotification(notification);
The ClusterConnectionManager is the listener to handle the notification, meaning this
call will result in holding the lock on ClusterConnectionManager (its notify() mehtod is
synchronized)
4 then resume Step 2, the failover thread will grab the write lock and proceed. Along the
way let it stop for another time at MessagingPostOffice.removeBindingInMemory(), before
the line:
clusterNotifier.sendNotification(notification);
Now this failover thread holds the write lock of the post office and is about to get the
lock on ClusterConnectionManager (because it is the listener to handle the notification in
its synchronized notify() method).
5 Resume step 3, the consumer creating thread grabs the lock on ClusterConnectionManager.
As it goes on, it calls MessagingPostOffice.getBindings() method, which is going to get
the read lock of MessagingPostOffice. However at this moment the failover thread is
holding the write lock, which prevents it from getting the read lock. It has to wait for
it.
6. Now resume step 4. As it happens, in step 5 the consumer creating thread already gets
the lock on ClusterConnectionManager, so it cannot proceed. In the meantime it holds the
write lock and never get released, so the consumer creating thread will never get it. The
two thread get dead-locked.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: