Hello,
First, and perhaps completely unrelated: Is it normal to see messages such as
"additional data: 19 bytes" throughout the logs?
Moving on...
I have a two-node cluster of stock 4.0.5GA servers. After roughly 4 hours of operation
one node will fail with an OutOfMemoryError stemming from org.jgroups.protocols.UDP. Both
servers have two eth interfaces, so I set bind_addr on the UDP element accordingly in
cluster-service.xml and jboss-service.xml in the tc5-cluster sar.
I enabled DEBUG for jgroups. It seems to get pretty messy. First, node2 stops
ack'ing on are-you-alive messages. Then node1 gets susptected, but for no apparent
reason. If I understand correctly, node1 is the coord, so node2 can't remove it and
it will refuse to remove itself from the view. It may, however, opt to leave and rejoin.
Below is an excerpt from the cluster log file from around the time things begin to go
awry. Any hints are greatly appreciated.
| 2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to
node2:32802 (own address=node1:32839)
| 2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP]
| sending msgs:
| node2:32802: 1 msgs
|
| 2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from
node2:32802
| 2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP]
| sending msgs:
| node2:32802: 1 msgs
|
| 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to
node2:32802 (own address=node1:32839)
| 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from
node2:32802 (number=0)
| 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to
node2:32805 (additional data: 19 bytes) (own address=node1:32842 (addit
| ional data: 19 bytes))
| 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from
node2:32805 (additional data: 19 bytes) (number=0)
| 2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD:
SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], fro
| m=node2:32805 (additional data: 19 bytes))]
| 2007-04-25 15:34:51,767 WARN [org.jgroups.protocols.FD] I was suspected, but will not
remove myself from membership (waiting for EXIT message)
| 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task
started; num_gossip_runs=3, max_gossip_runs=3
| 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl]
view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19
| bytes)]
| 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS]
[local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 (additional
data:
| 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]
| 2007-04-25 15:34:51,780 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion()
failed, node1:32842 (additional data: 19 bytes) is not a member of view
| [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19
bytes)]; discarding view
| 2007-04-25 15:34:51,781 WARN [org.jgroups.protocols.pbcast.GMS] I (node1:32842
(additional data: 19 bytes)) am being shunned, will leave and rejoin group (p
| rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 (additional data:
19 bytes) ])
| 2007-04-25 15:34:51,781 INFO [org.jgroups.JChannel] received an EXIT event, will
leave the channel
| 2007-04-25 15:34:51,783 INFO [org.jgroups.JChannel] closing the channel
| 2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional
data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
| led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
| at java.net.PlainDatagramSocketImpl.receive0(Native Method)
| at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
| at java.net.DatagramSocket.receive(DatagramSocket.java:724)
| at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
| at java.lang.Thread.run(Thread.java:799)
|
| 2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839]
exception=java.lang.OutOfMemoryError: heap allocation failed, stack trace=java.lang.O
| utOfMemoryError: heap allocation failed
| at java.net.PlainDatagramSocketImpl.receive0(Native Method)
| at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
| at java.net.DatagramSocket.receive(DatagramSocket.java:724)
| at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
| at java.lang.Thread.run(Thread.java:799)
|
| 2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for
node1:32842 (additional data: 19 bytes):
|
| sent_msgs: [6837 - 6890]
| received_msgs:
| node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [276 -
328]
| node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [6838 -
6890]
|
| 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to node2:32805
(additional data: 19 bytes) was reset
| 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread
terminated
| 2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP]
| sending msgs:
| node1:32839: 1 msgs
|
| 2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional
data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
| led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
| at java.net.PlainDatagramSocketImpl.receive0(Native Method)
| at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
| at java.net.DatagramSocket.receive(DatagramSocket.java:724)
| at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
| at java.lang.Thread.run(Thread.java:799)
|
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4041136#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...