[
https://issues.jboss.org/browse/JGRP-1427?page=com.atlassian.jira.plugin....
]
Jay Guidos edited comment on JGRP-1427 at 2/26/12 4:05 PM:
-----------------------------------------------------------
Hi Bela,
I will attach the patch to the JIRA for BasicConnectionTable.
I have already reviewed the code for TCPConnectionMap and from a thread safety point of
view the code is of much higher quality than the code in the 2.4 branch. With
BasicConnectionTable, even with the patches I included there were a number of other
'questionable' bits of code - but since it is unsupported code there is not too
much to worry about.
Jay
Here is the patch:
Index: src/org/jgroups/blocks/BasicConnectionTable.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/org/jgroups/blocks/BasicConnectionTable.java (revision
17ba6e2fc6974567b0de666f2391d2dec24312d3)
+++ src/org/jgroups/blocks/BasicConnectionTable.java (revision )
@@ -17,7 +17,9 @@
import java.util.Map.Entry;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
@@ -718,38 +720,37 @@
class Sender implements Runnable {
- Thread senderThread;
- private boolean is_it_running=false;
+ AtomicReference<Thread> senderThread = new
AtomicReference<Thread>();
+ private AtomicBoolean is_it_running=new AtomicBoolean();
void start() {
- if(senderThread == null || !senderThread.isAlive()) {
- senderThread=getThreadFactory().newThread(thread_group,this,
"ConnectionTable.Connection.Sender local_addr=" + local_addr + " [" +
getSockAddress() + "]");
- senderThread.setDaemon(true);
- is_it_running=true;
- senderThread.start();
+ Thread localThread =
senderThread.getAndSet(getThreadFactory().newThread(thread_group, this,
"ConnectionTable.Connection.Sender local_addr=" + local_addr + " [" +
getSockAddress() + "]"));
+ if(localThread == null ) {
+ is_it_running.set(true);
+ senderThread.get().setDaemon(true);
+ senderThread.get().start();
if(log.isTraceEnabled())
log.trace("sender thread started: " + senderThread);
}
}
void stop() {
- is_it_running=false;
+ is_it_running.set(false);
if(send_queue != null)
send_queue.clear();
- if(senderThread != null) {
- Thread tmp=senderThread;
- senderThread=null;
- Util.interruptAndWaitToDie(tmp);
+ Thread localThread = senderThread.getAndSet(null);
+ if(localThread != null) {
+ Util.interruptAndWaitToDie(localThread);
}
}
boolean isRunning() {
- return is_it_running && senderThread != null;
+ return is_it_running.get() && senderThread.get() != null;
}
public void run() {
byte[] data;
- while(senderThread != null &&
senderThread.equals(Thread.currentThread()) && is_it_running) {
+ while(senderThread.get() != null &&
senderThread.equals(Thread.currentThread()) && is_it_running.get()) {
try {
data=send_queue.take();
if(data == null)
@@ -761,7 +762,7 @@
;
}
}
- is_it_running=false;
+ is_it_running.set(false);
if(log.isTraceEnabled())
log.trace("ConnectionTable.Connection.Sender thread
terminated");
}
was (Author:
jay.guidos-bidstrading.com):
Hi Bela,
I will attach the patch to the JIRA for BasicConnectionTable.
I have already reviewed the code for TCPConnectionMap and from a thread safety point of
view the code is of much higher quality than the code in the 2.4 branch. With
BasicConnectionTable, even with the patches I included there were a number of other
'questionable' bits of code - but since it is unsupported code there is not too
much to worry about.
Ja
Race conditions during TCP start up cause broken connections and
cluster-wide slow down
---------------------------------------------------------------------------------------
Key: JGRP-1427
URL:
https://issues.jboss.org/browse/JGRP-1427
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.5
Reporter: Jay Guidos
Assignee: Bela Ban
Fix For: 3.1
I actually found this on 2.4.1, but I can reproduce it on the Git master.
When starting our cluster of 18 nodes using TCP, perhaps one start in 20 results in a
node having very slow cluster response. Some trace logging eventually revealed that on
random occasions one of the TCP connections to a sister node would start and then
immediate terminate. From that point on, every cluster broadcast would be subject to the
response timeout (for us it was at 60 seconds).
The problem is in org.jgroups.blocks.BasicConnectionTable. There is a number of
non-threadsafe field reference updates during startup, but in particular in
BasicConnectionTable.Sender.start() the 'senderThread' field is updated in one
thread, and immediately referenced in the daughter thread's invocation of
BasicConnectionTable.Sender.run().
If the timing works out wrong, 'senderThread' is set in the L2 cache of the
PingSender thread, but not yet updated in the L2 cache used by
ConnectionTable.Connection.Sender, and hence the run() method exits immediately. Here is
the smoking gun from a trace log:
{noformat}
22:18:22,985 INFO [org.jgroups.blocks.ConnectionTable] {main} server socket created on
127.0.0.1:7800
22:18:22,997 DEBUG [org.jgroups.blocks.MessageDispatcher$ProtocolAdapter] {main} setting
local_addr (null) to 127.0.0.1:7800
22:18:23,013 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender}
ConnectionTable.Connection.Receiver started
22:18:23,013 INFO [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to
127.0.0.1:7801
22:18:23,015 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender}
ConnectionTable.Connection.Sender thread started
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender}
ConnectionTable.Connection.Receiver started
22:18:23,018 INFO [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to
127.0.0.1:7802
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable]
{ConnectionTable.Connection.Sender [127.0.0.1:44710 - 127.0.0.1:7802]}
ConnectionTable.Connection.Sender thread terminated
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender}
ConnectionTable.Connection.Sender thread started
22:18:23,019 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender}
ConnectionTable.Connection.Receiver started
2
{noformat}
You can see that at 22:18:23,018 the PingSender thread created a Sender, and in the same
millisecond the ConnectionTable.Connection.Sender thread killed it off.
The fix is easy, just convert 'senderThread' to an AtomicReference. This is very
old code, I am not sure why we were the first ones to report this? Perhaps it is because
our application is heavily multithreaded and we run on servers with 16 physical CPU cores.
There is a very high chance that PingSender and ConnectionTable.Connection.Sender were not
bound to the same core, and hence their L2 caches had significant time intervals where
they were out of sync. There were a number of other sloppy references and updates to
variables that were used for thread control in BasicConnectionTable, I think this code
could benefit from a review for that kind of thing.
Cheers!
Jay Guidos
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira