[jboss-jira] [JBoss JIRA] Reopened: (JGRP-1003) deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()
Victor N (JIRA)
jira-events at lists.jboss.org
Tue Feb 9 04:46:19 EST 2010
[ https://jira.jboss.org/jira/browse/JGRP-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Victor N reopened JGRP-1003:
----------------------------
Bela, the problem was reproduced with 2.8.0.CR6 and the following config:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
<TCP bind_addr="${bindAddress}"
bind_port="40001"
loopback="true"
recv_buf_size="20000000"
send_buf_size="640000"
discard_incompatible_packets="true"
max_bundle_size="64000"
max_bundle_timeout="30"
enable_bundling="true"
use_send_queues="true"
sock_conn_timeout="300"
skip_suspected_members="true"
thread_pool.enabled="true"
thread_pool.min_threads="1"
thread_pool.max_threads="25"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="10000"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="8"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="1000"
oob_thread_pool.rejection_policy="discard"/>
<TCPPING timeout="3000"
initial_hosts="${nodes}"
port_range="1"
num_initial_members="3"/>
<MERGE2 min_interval="10000"
max_interval="30000"/>
<FD_SOCK/>
<FD timeout="10000" max_tries="5" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200" />
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
view_bundling="true"/>
<FC max_credits="2000000"
min_threshold="0.10"/>
<FRAG2 frag_size="60000" />
</config>
I have 10 nodes in cluster: server3 ... server12
The problem is still the same - please look at new stack traces (jgroups_stacktrace_09feb2010.txt). Threads are blocked on a full queue, so jgroups can not send messages (incl. MERGE). This stacktrace was taken from "server8" - all other nodes were working perfectly excepting server8.
>From log file at server8:
08/02/2010 20:23:54| ERROR | org.jgroups.protocols.TP.down(): failed sending message to server10 (182 bytes): java.net.SocketTimeoutException: connect timed out
08/02/2010 20:23:56| WARN | org.jgroups.protocols.TP.sendToSingleMember(): server8: no physical address for server10, dropping message
08/02/2010 20:24:06| WARN | org.jgroups.protocols.pbcast.GMS.installView(): server8: not member of view [server3|460] [server3, server6, server5, server10, server11, server9, server12, server7, server4]; discarding it
08/02/2010 20:41:03| WARN | org.jgroups.protocols.pbcast.GMS.installView(): server8: not member of view [server3|462] [server3, server6, server5, server10, server11, server9, server12, server7, server4]; discarding it
08/02/2010 20:55:08| WARN | org.jgroups.protocols.pbcast.GMS.installView(): server8: not member of view [server3|464] [server3, server6, server5, server10, server11, server9, server12, server7, server4]; discarding it
08/02/2010 20:55:22| WARN | org.jgroups.protocols.pbcast.GMS.castViewChangeWithDest(): server8: failed to collect all ACKs (expected=10) for view MergeView::[server3|465] [server3, server6, server5, server8, server10, server11, server9, server12, server7, server4], subgroups=[[server3|464] [server3, server6, server5, server10, server11, server9, server12, server7, server4], [server3|463] [server8]] after 2000ms, missing ACKs from [server10, server9]
08/02/2010 21:10:44| WARN | org.jgroups.protocols.pbcast.GMS.castViewChangeWithDest(): server8: failed to collect all ACKs (expected=10) for view MergeView::[server3|467] [server3, server6, server5, server8, server10, server11, server9, server12, server7, server4], subgroups=[[server3|466] [server3, server6, server5, server10, server11, server9, server12, server7, server4], [server3|465] [server8]] after 2000ms, missing ACKs from [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:26:30| WARN | org.jgroups.protocols.pbcast.Merger.handleMergeRequest(): server8: merge is already in progress
08/02/2010 21:26:55| WARN | org.jgroups.protocols.pbcast.Merger.handleMergeRequest(): server8: merge is already in progress
...(again and again - merge is already in progress)...
at the same time at server3 (it was coordinator) - system clock here differs a little bit, I checked:
08/02/2010 20:55:38| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server3: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|464] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:11:01| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server3: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|466] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:26:23| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server3: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|468] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:26:28| WARN | org.jgroups.protocols.pbcast.Merger$MergeTask.run(): server3: merge leader did not get data from all partition coordinators [server8, server3], merge is cancelled
08/02/2010 21:27:15| ERROR | org.jgroups.protocols.pbcast.Merger.handleMergeResponse(): server3: this.merge_id (server3::21) is different from merge_id (server3::19)
08/02/2010 21:27:19| WARN | org.jgroups.protocols.pbcast.Merger$MergeTask.run(): server3: merge leader did not get data from all partition coordinators [server8, server3], merge is cancelled
08/02/2010 21:27:45| WARN | org.jgroups.protocols.pbcast.Merger$MergeTask.run(): server3: merge leader did not get data from all partition coordinators [server8, server3], merge is cancelled
08/02/2010 21:28:31| ERROR | org.jgroups.protocols.pbcast.Merger.handleMergeResponse(): server3: this.merge_id (server3::24) is different from merge_id (server3::20)
and at server5 (for comparison):
08/02/2010 20:55:16| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server5: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|464] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:10:38| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server5: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|466] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
08/02/2010 21:26:01| WARN | org.jgroups.protocols.pbcast.NAKACK.handleMessage(): server5: dropped message from server8 (not in retransmission table), members are [server11, server10, server7, server12, server3, server5, server9, server6, server4], view=[server3|468] [server3, server6, server5, server10, server11, server9, server12, server7, server4]
> deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()
> ------------------------------------------------------------------------------------------------
>
> Key: JGRP-1003
> URL: https://jira.jboss.org/jira/browse/JGRP-1003
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.7
> Environment: Linux (Debian), kernel 2.6.x
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.8
>
> Attachments: j1__.txt, jgroups_s2.txt, jgroups_stacktrace_09feb2010.txt, stacktrace.txt
>
>
> I am using JGroups 2.7.0 GA with the typical protocols stack based on TCP (taken from tcp.xml).
> Sometimes it occurs that socket write operation hangs inside TP.send() -- this is not so untypical for blocking I/O approach! -- after that JGroups is not working at all because most of threads are blocked inside TP.send() waiting for that first thread releasing "out_stream_lock". Below is a fragment from my stack trace.
> 1) the thread that hangs in socket write until forever (maybe due to a broken socket or something else); this thread owns "out_stream_lock in TP class":
> "Timer-1,name,IP:port" daemon prio=10 tid=0x084e5c00 nid=0x15f0 runnable [0x211c1000..0x211c2040]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> - locked <0x43427300> (a java.io.BufferedOutputStream)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.jgroups.blocks.TCPConnectionMap$TCPConnection.doSend(TCPConnectionMap.java:472)
> at org.jgroups.blocks.TCPConnectionMap$TCPConnection._send(TCPConnectionMap.java:453)
> at org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(TCPConnectionMap.java:436)
> at org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap.java:341)
> at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:137)
> at org.jgroups.protocols.TCP.send(TCP.java:53)
> at org.jgroups.protocols.BasicTCP.sendToSingleMember(BasicTCP.java:141)
> at org.jgroups.protocols.TP.doSend(TP.java:1105)
> at org.jgroups.protocols.TP.send(TP.java:1088)
> at org.jgroups.protocols.TP.down(TP.java:907)
> at org.jgroups.protocols.Discovery.down(Discovery.java:363)
> at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)
> at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)
> at org.jgroups.protocols.FD.down(FD.java:327)
> at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:72)
> at org.jgroups.protocols.BARRIER.down(BARRIER.java:96)
> at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1530)
> at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1476)
> at org.jgroups.stack.Retransmitter$Task.run(Retransmitter.java:207)
> at org.jgroups.util.TimeScheduler$TaskWrapper.run(TimeScheduler.java:218)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> Locked ownable synchronizers:
> - <0x2b5522e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> - <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> - <0x434273f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> 2) and there are many threads like this blocked "on out_stream_lock.lock()" call inside TP class:
> "Connection.Receiver [ip:port - ip:port],name,ip:port" prio=10 tid=0x020dc000 nid=0x1fab waiting on condition [0x1365f000..0x1365ffc0]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
> at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
> at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
> at org.jgroups.protocols.TP.send(TP.java:1082)
> at org.jgroups.protocols.TP.down(TP.java:907)
> at org.jgroups.protocols.Discovery.up(Discovery.java:274)
> at org.jgroups.protocols.TP.passMessageUp(TP.java:995)
> at org.jgroups.protocols.TP.access$100(TP.java:52)
> at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1451)
> at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1427)
> at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:1738)
> at org.jgroups.util.ShutdownRejectedExecutionHandler.rejectedExecution(ShutdownRejectedExecutionHandler.java:34)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at org.jgroups.protocols.TP.dispatchToThreadPool(TP.java:1061)
> at org.jgroups.protocols.TP.receive(TP.java:1038)
> at org.jgroups.protocols.BasicTCP.receive(BasicTCP.java:180)
> at org.jgroups.blocks.TCPConnectionMap$TCPConnection$ConnectionPeerReceiver.run(TCPConnectionMap.java:553)
> at java.lang.Thread.run(Thread.java:619)
>
> Locked ownable synchronizers:
> - None
> (Since all threads from JGroups' ThreadPool are locked, you can see "rejectedExecution" in the last stack trace).
> Maybe we could implement some solution for TCP (blocking I/O)? Maybe it is possible to release the lock before socket write?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list