[jboss-jira] [JBoss JIRA] Commented: (JGRP-692) Unable to recover from suspect/merge, with auto-reconnect

Tuesday, 18 March 2008

    [ http://jira.jboss.com/jira/browse/JGRP-692?page=comments#action_12403295 ] 

Matt Magoffin commented on JGRP-692:
------------------------------------

So on production so far, this is looking promising. I've seen successful merge events
occur after socket exceptions:

2008-03-18 00:46:44,508 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to
172.16.172.234:19282: java.net.SocketException: Broken pipe
2008-03-18 00:46:44,519 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to
172.16.172.234:19282: java.net.SocketException: Socket closed
...
2008-03-18 00:46:47,489 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19282, coord_addr=172.16.172.234:19282,
is_server=true], [own_addr=172.16.172.233
:19282, coord_addr=172.16.172.233:19282, is_server=true]]
2008-03-18 00:46:47,489 DEBUG [org.jgroups.protocols.MERGE2] found multiple coordinators:
[172.16.172.234:19282, 172.16.172.233:19282]; sending up MERGE event
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19182, coord_addr=172.16.172.233:19182,
is_server=true], [own_addr=172.16.172.233
:19182, coord_addr=172.16.172.233:19182, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19183, coord_addr=172.16.172.233:19183,
is_server=true], [own_addr=172.16.172.233
:19183, coord_addr=172.16.172.233:19183, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19082, coord_addr=172.16.172.233:19082,
is_server=true], [own_addr=172.16.172.233
:19082, coord_addr=172.16.172.233:19082, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19382, coord_addr=172.16.172.234:19382,
is_server=true], [own_addr=172.16.172.233
:19382, coord_addr=172.16.172.233:19382, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] found multiple coordinators:
[172.16.172.234:19382, 172.16.172.233:19382]; sending up MERGE event
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19083, coord_addr=172.16.172.233:19083,
is_server=true], [own_addr=172.16.172.233
:19083, coord_addr=172.16.172.233:19083, is_server=true]]
2008-03-18 00:46:47,500 DEBUG [org.jgroups.protocols.MERGE2]
initial_mbrs=[[own_addr=172.16.172.234:19383, coord_addr=172.16.172.233:19383,
is_server=true], [own_addr=172.16.172.233
:19383, coord_addr=172.16.172.233:19383, is_server=true]]
2008-03-18 00:46:47,556 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task started,
coordinators are [172.16.172.234:19282, 172.16.172.233:19282]
2008-03-18 00:46:47,556 DEBUG [org.jgroups.protocols.pbcast.GMS] sending MERGE_REQ to
[172.16.172.234:19282, 172.16.172.233:19282]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for
merge responses
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message
garbage collection
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started,
max_suspend_time=200000
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS]
sender=172.16.172.233:19282, merge_id=[172.16.172.233:19282|1205819207556]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS]
response=GmsHeader[MERGE_RSP]: view=[172.16.172.233:19282|3] [172.16.172.233:19282,
172.16.172.234:19282], digest=17
2.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1047 (1047)],
merge_rejected=false, merge_id=[172.16.172.233:19282|1205819207556]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2,
actual responses=1
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for
merge responses
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task started,
coordinators are [172.16.172.234:19382, 172.16.172.233:19382]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] sending MERGE_REQ to
[172.16.172.234:19382, 172.16.172.233:19382]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for
merge responses
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message
garbage collection
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started,
max_suspend_time=200000
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS]
sender=172.16.172.233:19382, merge_id=[172.16.172.233:19382|1205819207559]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS]
response=GmsHeader[MERGE_RSP]: view=[172.16.172.233:19382|5] [172.16.172.233:19382,
172.16.172.234:19382], digest=17
2.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1933 (1933)],
merge_rejected=false, merge_id=[172.16.172.233:19382|1205819207559]
2008-03-18 00:46:47,560 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2,
actual responses=1
2008-03-18 00:46:47,560 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 9999 msecs for
merge responses
2008-03-18 00:46:47,592 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2,
actual responses=2
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2,
actual responses=2
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is
sender=172.16.172.233:19282, view=[172.16.172.233:19282|3] [172.16.172.233:19282,
172.16.172.234:19282
], digest=172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1047
(1047)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is
sender=172.16.172.233:19382, view=[172.16.172.233:19382|5] [172.16.172.233:19382,
172.16.172.234:19382
], digest=172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1933
(1933)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is
sender=172.16.172.234:19282, view=[172.16.172.234:19282|4] [172.16.172.234:19282],
digest=172.16.172.2
34:19282: [1043 : 1048 (1048)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is
sender=172.16.172.234:19382, view=[172.16.172.234:19382|6] [172.16.172.234:19382],
digest=172.16.172.2
34:19382: [1916 : 1934 (1934)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] new merged view will be
MergeView::[172.16.172.233:19282|5] [172.16.172.233:19282, 172.16.172.234:19282],
subgroups=
[[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282],
[172.16.172.234:19282|4] [172.16.172.234:19282]]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] new merged view will be
MergeView::[172.16.172.233:19382|7] [172.16.172.233:19382, 172.16.172.234:19382],
subgroups=
[[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382],
[172.16.172.234:19382|6] [172.16.172.234:19382]]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] consolidated
digest=172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1048
(1048)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] consolidated
digest=172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1934
(1934)]
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message
garbage collection
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message
garbage collection
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task terminated
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task terminated
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS]
view=MergeView::[172.16.172.233:19282|5] [172.16.172.233:19282, 172.16.172.234:19282],
subgroups=[[172.16.172.233:19
282|3] [172.16.172.233:19282, 172.16.172.234:19282], [172.16.172.234:19282|4]
[172.16.172.234:19282]], digest=172.16.172.233:19282: [8391 : 8449 (8449)],
172.16.172.234:19282: [1043
 : 1048 (1048)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS]
view=MergeView::[172.16.172.233:19382|7] [172.16.172.233:19382, 172.16.172.234:19382],
subgroups=[[172.16.172.233:19
382|5] [172.16.172.233:19382, 172.16.172.234:19382], [172.16.172.234:19382|6]
[172.16.172.234:19382]], digest=172.16.172.233:19382: [446 : 448 (448)],
172.16.172.234:19382: [1916 : 
1934 (1934)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.NAKACK] existing digest: 
172.16.172.233:19282: [8391 : 8450 (8450)], 172.16.172.234:19282: [1043 : 1047 (1047)]
new digest:       172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043
: 1048 (1048)]
resulting digest: 172.16.172.233:19282: [8391 : 8450 (8450)], 172.16.172.234:19282: [1043
: 1048 (1048)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS]
[local_addr=172.16.172.233:19282] view is MergeView::[172.16.172.233:19282|5]
[172.16.172.233:19282, 172.16.172.234:
19282], subgroups=[[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282],
[172.16.172.234:19282|4] [172.16.172.234:19282]]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.NAKACK] existing digest: 
172.16.172.233:19382: [446 : 449 (449)], 172.16.172.234:19382: [1916 : 1933 (1933)]
new digest:       172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 :
1934 (1934)]
resulting digest: 172.16.172.233:19382: [446 : 449 (449)], 172.16.172.234:19382: [1916 :
1934 (1934)]
2008-03-18 00:46:47,650 DEBUG [org.jgroups.protocols.pbcast.GMS]
[local_addr=172.16.172.233:19382] view is MergeView::[172.16.172.233:19382|7]
[172.16.172.233:19382, 172.16.172.234:
19382], subgroups=[[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382],
[172.16.172.234:19382|6] [172.16.172.234:19382]]

And afterwards, the channel seems to get back to normal.

...
 Unable to recover from suspect/merge, with auto-reconnect
 ---------------------------------------------------------

                 Key: JGRP-692
                 URL: http://jira.jboss.com/jira/browse/JGRP-692
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.1
         Environment: JBoss 4.2.2 on Linux 2.4.21-47.0.1.ELsmp i686 using Java HotSpot(TM)
Server VM (build 1.5.0_10-b03, mixed mode)
            Reporter: Matt Magoffin
         Assigned To: Bela Ban
             Fix For: 2.6.3

         Attachments: jgroups-logs.tbz2

 I'm having an issue with a 2-machine cluster using a TCP stack
 based on the tcp.xml from JGroups 2.6.1. On each machine I have 8 separate
 channels running, on different ports, with 4 groups in 2 JVM instances.
 After some period of time, one machine will fail to respond to a FD ping,
 and gets suspected. The machine that failed is not responding in time it
 seems from high CPU use, and many of the channels will fail FD around the
 same time. The channels are configured with auto-reconnect. My
 understanding was that the channel should "heal" itself and eventually
 re-form into a new view with the same 2 members in the cluster, which
 should apply to this situation because the machine that failed to respond
 eventually will respond.
 However, the group does not always seem to "heal" (sometimes it does,
 sometimes not). Once it stops healing, it never seems to ever do so again,
 and I get tons of NAKACK "message X not found in retransmission table"
 ERROR logs. The only way to get the channel working agin is to shut down
 the channel on both machines and then start them up again.
 I'm not using muxed channels... just normal channels. 
 I have for now disabled shunning, and the channels seem to be able to
 re-connect after a shun situation occurs, but after time I'm still seeing
 something wrong with the channel in that the nodes are not able to send
 messages to each other successfully, and I have tons of
 2008-02-21 21:01:53,572 WARN  [org.jgroups.protocols.pbcast.NAKACK]
 172.16.172.233:19182] discarded message from non-member
 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
 [172.16.172.233:19182]
 log entries, even while FD is receiving acks on that channel.
 this is a degradation problem for me while my servers are running,
 not just during deployment. After a while of running, the nodes get
 shunned/disconnected some how (often from a slow response from the other
 node) and then fail to ever merge back to form the original view of the 2
 nodes again.
 Now even with shun set to false in both FD and pbcast.GMS on both nodes
 for all channels, eventually still the nodes reach the same state of never
 re-forming... and I have tons eventual
 2008-02-21 21:28:24,857 WARN  [org.jgroups.protocols.pbcast.NAKACK]
 172.16.172.233:19182] discarded message from non-member
 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
 [172.16.172.233:19182]
 After a while in my application I'll force the channel to close wait a
 minute, and reopen, and sometimes this gets the channel working again, but
 doesn't always seem to. 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Commented: (JGRP-692) Unable to recover from suspect/merge, with auto-reconnect