[jboss-jira] [JBoss JIRA] Updated: (JGRP-692) Unable to recover from suspect/merge, with auto-reconnect

Thu Feb 21 23:17:42 EST 2008

     [ http://jira.jboss.com/jira/browse/JGRP-692?page=all ]

Matt Magoffin updated JGRP-692:
-------------------------------

    Attachment: jgroups-logs.tbz2

Hopefully these 2 logs files can be of help to you. There is one long from
each node, app1 and app2 (.233 and .234 IP addresses). There are several
channels running, all on different ports.

I've started both logs at about the same point in time, when app2 was shut
restarted in order to deploy some new code. This way you can see how app1
responds to a normal exit by app2 from the group, and see how app1 sees
app2 come back up again (line 412 in app1 log, at 2008-02-21
18:47:32,014).

On line 752 in app1 at 2008-02-21 19:42:34,794, FD does not get acks, and
suspects app2.

On line 979 in app1 at 2008-02-21 19:43:49,829 there is a NAKACK discard
message to a message from app2, followed by some

[org.jgroups.protocols.FD] Recevied Ack. is invalid (was from:
172.16.172.234:19282)

type stuff. Things then don't seem to work again on app1, but you can see
the application forces one channel closed, which happens on line 6186
(2008-02-21 21:13:10,946), in an attempt to get the channel working again.
By line 6216 (2008-02-21 21:13:14,444) that reset chanel is up and
working, and messages are getting sent to app2. But I still a NAKACK
warning coming from this newly reset channel on lin 6221:

2008-02-21 21:13:15,119 WARN  [org.jgroups.protocols.pbcast.NAKACK]
172.16.172.233:19182] discarded message from non-member
172.16.172.234:19182, my view is [172.16.172.233:19182|8]
[172.16.172.233:19182]

I was hoping you could glean a pattern from this stuff.

> Unable to recover from suspect/merge, with auto-reconnect
> ---------------------------------------------------------
>
>                 Key: JGRP-692
>                 URL: http://jira.jboss.com/jira/browse/JGRP-692
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>         Environment: JBoss 4.2.2 on Linux 2.4.21-47.0.1.ELsmp i686 using Java HotSpot(TM) Server VM (build 1.5.0_10-b03, mixed mode)
>            Reporter: Matt Magoffin
>         Assigned To: Bela Ban
>         Attachments: jgroups-logs.tbz2
>
>
> I'm having an issue with a 2-machine cluster using a TCP stack
> based on the tcp.xml from JGroups 2.6.1. On each machine I have 8 separate
> channels running, on different ports, with 4 groups in 2 JVM instances.
> After some period of time, one machine will fail to respond to a FD ping,
> and gets suspected. The machine that failed is not responding in time it
> seems from high CPU use, and many of the channels will fail FD around the
> same time. The channels are configured with auto-reconnect. My
> understanding was that the channel should "heal" itself and eventually
> re-form into a new view with the same 2 members in the cluster, which
> should apply to this situation because the machine that failed to respond
> eventually will respond.
> However, the group does not always seem to "heal" (sometimes it does,
> sometimes not). Once it stops healing, it never seems to ever do so again,
> and I get tons of NAKACK "message X not found in retransmission table"
> ERROR logs. The only way to get the channel working agin is to shut down
> the channel on both machines and then start them up again.
> I'm not using muxed channels... just normal channels. 
> I have for now disabled shunning, and the channels seem to be able to
> re-connect after a shun situation occurs, but after time I'm still seeing
> something wrong with the channel in that the nodes are not able to send
> messages to each other successfully, and I have tons of
> 2008-02-21 21:01:53,572 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> log entries, even while FD is receiving acks on that channel.
> this is a degradation problem for me while my servers are running,
> not just during deployment. After a while of running, the nodes get
> shunned/disconnected some how (often from a slow response from the other
> node) and then fail to ever merge back to form the original view of the 2
> nodes again.
> Now even with shun set to false in both FD and pbcast.GMS on both nodes
> for all channels, eventually still the nodes reach the same state of never
> re-forming... and I have tons eventual
> 2008-02-21 21:28:24,857 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> After a while in my application I'll force the channel to close wait a
> minute, and reopen, and sometimes this gets the channel working again, but
> doesn't always seem to.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira