[jboss-user] [Clustering/JBoss] - Two-Node Cluster UDP OutOfMemoryError

jowizzle do-not-reply at jboss.com
Thu Apr 26 14:40:40 EDT 2007


Hello,

First, and perhaps completely unrelated: Is it normal to see messages such as "additional data: 19 bytes" throughout the logs?

Moving on...

I have a two-node cluster of stock 4.0.5GA servers.  After roughly 4 hours of operation one node will fail with an OutOfMemoryError stemming from org.jgroups.protocols.UDP.  Both servers have two eth interfaces, so I set bind_addr on the UDP element accordingly in cluster-service.xml and jboss-service.xml in the tc5-cluster sar.

I enabled DEBUG for jgroups.  It seems to get pretty messy.  First, node2 stops ack'ing on are-you-alive messages.  Then node1 gets susptected, but for no apparent reason.  If I understand correctly, node1 is the coord, so node2 can't remove it and it will refuse to remove itself from the view.  It may, however, opt to leave and rejoin.

Below is an excerpt from the cluster log file from around the time things begin to go awry.  Any hints are greatly appreciated.


  | 2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
  | 2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node2:32802: 1 msgs
  | 
  | 2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from node2:32802
  | 2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node2:32802: 1 msgs
  | 
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32802 (number=0)
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32805 (additional data: 19 bytes) (own address=node1:32842 (addit
  | ional data: 19 bytes))
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32805 (additional data: 19 bytes) (number=0)
  | 2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], fro
  | m=node2:32805 (additional data: 19 bytes))]
  | 2007-04-25 15:34:51,767 WARN  [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 
  | bytes)]
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 (additional data: 
  | 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]
  | 2007-04-25 15:34:51,780 WARN  [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, node1:32842 (additional data: 19 bytes) is not a member of view
  |  [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]; discarding view
  | 2007-04-25 15:34:51,781 WARN  [org.jgroups.protocols.pbcast.GMS] I (node1:32842 (additional data: 19 bytes)) am being shunned, will leave and rejoin group (p
  | rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 (additional data: 19 bytes) ])
  | 2007-04-25 15:34:51,781 INFO  [org.jgroups.JChannel] received an EXIT event, will leave the channel
  | 2007-04-25 15:34:51,783 INFO  [org.jgroups.JChannel] closing the channel
  | 2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
  | led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 
  | 2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839] exception=java.lang.OutOfMemoryError: heap allocation failed, stack trace=java.lang.O
  | utOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 
  | 2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node1:32842 (additional data: 19 bytes):
  | 
  | sent_msgs: [6837 - 6890]
  | received_msgs:
  | node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [276 - 328]
  | node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [6838 - 6890]
  | 
  | 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to node2:32805 (additional data: 19 bytes) was reset
  | 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
  | 2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node1:32839: 1 msgs
  | 
  | 2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
  | led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4041136#4041136

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4041136



More information about the jboss-user mailing list