[jboss-user] [Clustering] - Cluster falls apart: FD_SOCK errors

kennardconsulting do-not-reply at jboss.com
Thu Oct 1 06:46:55 EDT 2009


Guys,

We are seeing these warnings on our cluster nodes sometimes after a redeploy. The primary cluster node stays up,
but the other 2 (we have 3) die and keep saying this.
 
2009-10-01 20:04:34,302 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:37,315 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:40,318 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:43,321 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:46,323 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:49,326 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:52,329 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
  | 2009-10-01 20:04:55,332 main WARN  [org.jgroups.protocols.pbcast.GMS] join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), retrying
 
They keep saying this even after a restart. Only restarting the primary node fixes things. We can see 192.168.1.1:52187 is listening on the primary with netstat -panu
 
On that server, the logs say

192.168.1.2:37802 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.1.2:37802 could not be fetched, retrying.
 
However, on the 192.168.1.2 box, we can see 192.168.1.2:37802 is also listening.
 
We can ping and telnet to tcp ports on both boxes and between each box, and ipconfig shows no packet losses, collisions or other
networking errors.  Also a restart of JBoss on the primary fixes the problem.
 
Therefore the problem seems to be something within JBoss.
 
On the primary server there are a number of threads referring to port 52187 and none of these seem locked up (as far as I can see).
 
Here are some of the threads from the Primary:
 
This looks like the server socket:
 
"FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
  |    java.lang.Thread.State: RUNNABLE
  |         at java.net.PlainSocketImpl.socketAccept(Native Method)
  |         at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
  |         - locked <0xab976ea0> (a java.net.SocksSocketImpl)
  |         at java.net.ServerSocket.implAccept(ServerSocket.java:453)
  |         at java.net.ServerSocket.accept(ServerSocket.java:421)
  |         at org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1022)
  |         at java.lang.Thread.run(Thread.java:619)
 

This looks like the connection handler:
 
"FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" daemon prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c
  | 71000]
  |    java.lang.Thread.State: RUNNABLE
  |         at java.net.SocketInputStream.socketRead0(Native Method)
  |         at java.net.SocketInputStream.read(SocketInputStream.java:129)
  |         at java.net.SocketInputStream.read(SocketInputStream.java:182)
  |         at org.jgroups.protocols.FD_SOCK$ClientConnectionHandler.run(FD_SOCK.java:1089)
  |         at java.lang.Thread.run(Thread.java:619)
 

And these are threads waiting in a pool by the look of it:
 
"Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting on condition [0x68e1f000]
  |    java.lang.Thread.State: WAITING (parking)
  |         at sun.misc.Unsafe.park(Native Method)
  |         - parking to wait for  <0x8b246af0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  |         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
  |         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
  |         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
  |         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
  |         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
  |         at java.lang.Thread.run(Thread.java:619)
  |  
  | "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting on condition [0x68e70000]
  |    java.lang.Thread.State: WAITING (parking)
  |         at sun.misc.Unsafe.park(Native Method)
  |         - parking to wait for  <0x8b246af0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  |         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
  |         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
  |         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
  |         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
  |         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
  |         at java.lang.Thread.run(Thread.java:619)
 

Here is a list of all threads referring to port 52187:
 
"VERIFY_SUSPECT.TimerThread,AvantProduction-HAPartitionCache,192.168.1.1:52187" daemon prio=10 tid=0x0b838400 nid=0x1d9d waiting on condition [0x5f1fe000]
  | "OOB-680,192.168.1.1:52187" prio=10 tid=0x73163400 nid=0x1d9c waiting on condition [0x66c48000]
  | "OOB-678,192.168.1.1:52187" prio=10 tid=0x6ef51800 nid=0x1c8e waiting on condition [0x66a62000]
  | "FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" daemon prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c71000]
  | "FD_SOCK pinger,AvantProduction-HAPartitionCache,192.168.1.1:52187" daemon prio=10 tid=0x092ba400 nid=0x69b7 waiting on condition [0x5f15c000]
  | "ViewHandler,AvantProduction-HAPartitionCache,192.168.1.1:52187" prio=10 tid=0x0c48d400 nid=0x6506 waiting on condition [0x5f1ad000]
  | "FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
  | "Incoming-20,192.168.1.1:52187" prio=10 tid=0x71df4400 nid=0x5f8d waiting on condition [0x68a02000]
  | "Incoming-19,192.168.1.1:52187" prio=10 tid=0x091ed400 nid=0x5f89 waiting on condition [0x68b46000]
  | "Incoming-18,192.168.1.1:52187" prio=10 tid=0x7161dc00 nid=0x5f88 waiting on condition [0x68b97000]
  | "Incoming-17,192.168.1.1:52187" prio=10 tid=0x091ecc00 nid=0x5f84 waiting on condition [0x68c8a000]
  | "Incoming-16,192.168.1.1:52187" prio=10 tid=0x72a4a800 nid=0x5f83 waiting on condition [0x68cdb000]
  | "Incoming-15,192.168.1.1:52187" prio=10 tid=0x091ca400 nid=0x5f80 waiting on condition [0x68dce000]
  | "Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting on condition [0x68e1f000]
  | "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting on condition [0x68e70000]
  | "Incoming-12,192.168.1.1:52187" prio=10 tid=0x08f7ec00 nid=0x5f7d waiting on condition [0x68ec1000]
  | "Incoming-11,192.168.1.1:52187" prio=10 tid=0x6d3f8400 nid=0x5f7c waiting on condition [0x68f12000]
  | "Incoming-10,192.168.1.1:52187" prio=10 tid=0x091c5800 nid=0x5f7b waiting on condition [0x68f63000]
  | "Incoming-9,192.168.1.1:52187" prio=10 tid=0x713f5800 nid=0x5f7a waiting on condition [0x68fb4000]
  | "Incoming-8,192.168.1.1:52187" prio=10 tid=0x09cbe000 nid=0x5f79 waiting on condition [0x69005000]
  | "Timer-12,192.168.1.1:52187" daemon prio=10 tid=0x713e5800 nid=0x5f78 waiting on condition [0x69056000]
  | "Timer-11,192.168.1.1:52187" daemon prio=10 tid=0x6cdf8800 nid=0x5f77 waiting on condition [0x690a7000]
  | "Timer-10,192.168.1.1:52187" daemon prio=10 tid=0x71471c00 nid=0x5f75 waiting on condition [0x69149000]
  | "Incoming-7,192.168.1.1:52187" prio=10 tid=0x09cbac00 nid=0x5f74 waiting on condition [0x6919a000]
  | "Incoming-6,192.168.1.1:52187" prio=10 tid=0x08f7f800 nid=0x5f72 waiting on condition [0x6923c000]
  | "Timer-9,192.168.1.1:52187" daemon prio=10 tid=0x08d8a800 nid=0x5f71 waiting on condition [0x6928d000]
  | "Timer-8,192.168.1.1:52187" daemon prio=10 tid=0x08fd1000 nid=0x5f6f waiting on condition [0x6932f000]
  | "Timer-7,192.168.1.1:52187" daemon prio=10 tid=0x0927f800 nid=0x5f6d waiting on condition [0x693d1000]
  | "Incoming-5,192.168.1.1:52187" prio=10 tid=0x091ae800 nid=0x5f6c waiting on condition [0x69422000]
  | "Incoming-4,192.168.1.1:52187" prio=10 tid=0x091ad400 nid=0x5f6b waiting on condition [0x69473000]
  | "Incoming-3,192.168.1.1:52187" prio=10 tid=0x08e98800 nid=0x5f6a waiting on condition [0x694c4000]
  | "Incoming-2,192.168.1.1:52187" prio=10 tid=0x08e97800 nid=0x5f69 waiting on condition [0x69515000]
  | "Incoming-1,192.168.1.1:52187" prio=10 tid=0x08ee5800 nid=0x5f68 waiting on condition [0x69566000]
  | "Timer-6,192.168.1.1:52187" daemon prio=10 tid=0x717d2800 nid=0x5f67 waiting on condition [0x695b7000]
  | "Timer-5,192.168.1.1:52187" daemon prio=10 tid=0x712ff000 nid=0x5f66 waiting on condition [0x69608000]
  | "Timer-4,192.168.1.1:52187" daemon prio=10 tid=0x711fd800 nid=0x5f64 waiting on condition [0x696aa000]
  | "Timer-3,192.168.1.1:52187" daemon prio=10 tid=0x09297800 nid=0x5f62 waiting on condition [0x696fb000]
  | "Timer-2,192.168.1.1:52187" daemon prio=10 tid=0x090a6800 nid=0x5f5f waiting on condition [0x697ee000]
  | "UDP mcast,192.168.1.1:52187" prio=10 tid=0x09956c00 nid=0x5f5e runnable [0x6983f000]
  | "UDP ucast,192.168.1.1:52187" prio=10 tid=0x09956800 nid=0x5f5d runnable [0x69890000]
  | "DiagnosticsHandler,192.168.1.1:52187" daemon prio=10 tid=0x08ee6c00 nid=0x5f5c runnable [0x698e1000]
  | "Timer-1,192.168.1.1:52187" daemon prio=10 tid=0x08e56c00 nid=0x5f5a waiting on condition [0x69983000]
 
 This is really hurting our up-time. Could anybody point us in the right direction?
 
 Regards,
 
 Richard.

View the original post : http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4258102#4258102

Reply to the post : http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=4258102



More information about the jboss-user mailing list