]
Bela Ban commented on JGRP-692:
-------------------------------
Yes, 2.6.3 has not yet been released. I wanted to wait for *this* issue to get fixed. No
chance you could test this (CVS checkout) in staging at least ?
Otherwise, if 2.6.3 won't fix your issue, we'd have to wait until 2.6.4...
I don't know when I'm going to cut 2.6.3 yet. The 2.6 branch is currently the
stable branch for JBoss AS 5, and we're back porting all the important bug fixes from
trunk... It's going to be a couple of weeks at least
Unable to recover from suspect/merge, with auto-reconnect
---------------------------------------------------------
Key: JGRP-692
URL:
http://jira.jboss.com/jira/browse/JGRP-692
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.1
Environment: JBoss 4.2.2 on Linux 2.4.21-47.0.1.ELsmp i686 using Java HotSpot(TM)
Server VM (build 1.5.0_10-b03, mixed mode)
Reporter: Matt Magoffin
Assigned To: Bela Ban
Fix For: 2.6.3
Attachments: jgroups-logs.tbz2
I'm having an issue with a 2-machine cluster using a TCP stack
based on the tcp.xml from JGroups 2.6.1. On each machine I have 8 separate
channels running, on different ports, with 4 groups in 2 JVM instances.
After some period of time, one machine will fail to respond to a FD ping,
and gets suspected. The machine that failed is not responding in time it
seems from high CPU use, and many of the channels will fail FD around the
same time. The channels are configured with auto-reconnect. My
understanding was that the channel should "heal" itself and eventually
re-form into a new view with the same 2 members in the cluster, which
should apply to this situation because the machine that failed to respond
eventually will respond.
However, the group does not always seem to "heal" (sometimes it does,
sometimes not). Once it stops healing, it never seems to ever do so again,
and I get tons of NAKACK "message X not found in retransmission table"
ERROR logs. The only way to get the channel working agin is to shut down
the channel on both machines and then start them up again.
I'm not using muxed channels... just normal channels.
I have for now disabled shunning, and the channels seem to be able to
re-connect after a shun situation occurs, but after time I'm still seeing
something wrong with the channel in that the nodes are not able to send
messages to each other successfully, and I have tons of
2008-02-21 21:01:53,572 WARN [org.jgroups.protocols.pbcast.NAKACK]
172.16.172.233:19182] discarded message from non-member
172.16.172.234:19182, my view is [172.16.172.233:19182|8]
[172.16.172.233:19182]
log entries, even while FD is receiving acks on that channel.
this is a degradation problem for me while my servers are running,
not just during deployment. After a while of running, the nodes get
shunned/disconnected some how (often from a slow response from the other
node) and then fail to ever merge back to form the original view of the 2
nodes again.
Now even with shun set to false in both FD and pbcast.GMS on both nodes
for all channels, eventually still the nodes reach the same state of never
re-forming... and I have tons eventual
2008-02-21 21:28:24,857 WARN [org.jgroups.protocols.pbcast.NAKACK]
172.16.172.233:19182] discarded message from non-member
172.16.172.234:19182, my view is [172.16.172.233:19182|8]
[172.16.172.233:19182]
After a while in my application I'll force the channel to close wait a
minute, and reopen, and sometimes this gets the channel working again, but
doesn't always seem to.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: