[jboss-jira] [JBoss JIRA] Commented: (JGRP-692) Unable to recover from suspect/merge, with auto-reconnect

Mon Feb 25 14:16:42 EST 2008

    [ http://jira.jboss.com/jira/browse/JGRP-692?page=comments#action_12400510 ] 

Matt Magoffin commented on JGRP-692:
------------------------------------

If you look in the attached log files, you'll see:

2008-02-21 21:13:10,947 INFO  [org.jgroups.JChannel] JGroups version: 2.6.1

We are definitely using 2.6.1, not 2.4.x. Also, in the FD class on line 261 of the 2.6.1 release, I see:

                            synchronized(this) {
                                Address previewNextPingDest = (Address)getPingDest(pingable_mbrs);
                                /* We are only interested to stop or restart the monitor thread iff the current target ping_dest is going
                                   change */
                                if(log.isDebugEnabled()) log.debug("Recevied Ack. is invalid (was from: " + hdr.from + "), ");
                                if ((previewNextPingDest != null && ping_dest != null && !previewNextPingDest.equals(ping_dest)) ||
                                        (previewNextPingDest != null && ping_dest == null) ||
                                        (previewNextPingDest == null && ping_dest != null)) {

Do you think this is still a duplicate of 699?

> Unable to recover from suspect/merge, with auto-reconnect
> ---------------------------------------------------------
>
>                 Key: JGRP-692
>                 URL: http://jira.jboss.com/jira/browse/JGRP-692
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>         Environment: JBoss 4.2.2 on Linux 2.4.21-47.0.1.ELsmp i686 using Java HotSpot(TM) Server VM (build 1.5.0_10-b03, mixed mode)
>            Reporter: Matt Magoffin
>         Assigned To: Bela Ban
>             Fix For: 2.6.2
>
>         Attachments: jgroups-logs.tbz2
>
>
> I'm having an issue with a 2-machine cluster using a TCP stack
> based on the tcp.xml from JGroups 2.6.1. On each machine I have 8 separate
> channels running, on different ports, with 4 groups in 2 JVM instances.
> After some period of time, one machine will fail to respond to a FD ping,
> and gets suspected. The machine that failed is not responding in time it
> seems from high CPU use, and many of the channels will fail FD around the
> same time. The channels are configured with auto-reconnect. My
> understanding was that the channel should "heal" itself and eventually
> re-form into a new view with the same 2 members in the cluster, which
> should apply to this situation because the machine that failed to respond
> eventually will respond.
> However, the group does not always seem to "heal" (sometimes it does,
> sometimes not). Once it stops healing, it never seems to ever do so again,
> and I get tons of NAKACK "message X not found in retransmission table"
> ERROR logs. The only way to get the channel working agin is to shut down
> the channel on both machines and then start them up again.
> I'm not using muxed channels... just normal channels. 
> I have for now disabled shunning, and the channels seem to be able to
> re-connect after a shun situation occurs, but after time I'm still seeing
> something wrong with the channel in that the nodes are not able to send
> messages to each other successfully, and I have tons of
> 2008-02-21 21:01:53,572 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> log entries, even while FD is receiving acks on that channel.
> this is a degradation problem for me while my servers are running,
> not just during deployment. After a while of running, the nodes get
> shunned/disconnected some how (often from a slow response from the other
> node) and then fail to ever merge back to form the original view of the 2
> nodes again.
> Now even with shun set to false in both FD and pbcast.GMS on both nodes
> for all channels, eventually still the nodes reach the same state of never
> re-forming... and I have tons eventual
> 2008-02-21 21:28:24,857 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> After a while in my application I'll force the channel to close wait a
> minute, and reopen, and sometimes this gets the channel working again, but
> doesn't always seem to.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira