[jboss-jira] [JBoss JIRA] Commented: (JGRP-692) Unable to recover from suspect/merge, with auto-reconnect

Tue Mar 18 03:01:50 EDT 2008

    [ http://jira.jboss.com/jira/browse/JGRP-692?page=comments#action_12403295 ] 

Matt Magoffin commented on JGRP-692:
------------------------------------

So on production so far, this is looking promising. I've seen successful merge events occur after socket exceptions:

2008-03-18 00:46:44,508 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 172.16.172.234:19282: java.net.SocketException: Broken pipe
2008-03-18 00:46:44,519 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 172.16.172.234:19282: java.net.SocketException: Socket closed
...
2008-03-18 00:46:47,489 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19282, coord_addr=172.16.172.234:19282, is_server=true], [own_addr=172.16.172.233
:19282, coord_addr=172.16.172.233:19282, is_server=true]]
2008-03-18 00:46:47,489 DEBUG [org.jgroups.protocols.MERGE2] found multiple coordinators: [172.16.172.234:19282, 172.16.172.233:19282]; sending up MERGE event
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19182, coord_addr=172.16.172.233:19182, is_server=true], [own_addr=172.16.172.233
:19182, coord_addr=172.16.172.233:19182, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19183, coord_addr=172.16.172.233:19183, is_server=true], [own_addr=172.16.172.233
:19183, coord_addr=172.16.172.233:19183, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19082, coord_addr=172.16.172.233:19082, is_server=true], [own_addr=172.16.172.233
:19082, coord_addr=172.16.172.233:19082, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19382, coord_addr=172.16.172.234:19382, is_server=true], [own_addr=172.16.172.233
:19382, coord_addr=172.16.172.233:19382, is_server=true]]
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] found multiple coordinators: [172.16.172.234:19382, 172.16.172.233:19382]; sending up MERGE event
2008-03-18 00:46:47,499 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19083, coord_addr=172.16.172.233:19083, is_server=true], [own_addr=172.16.172.233
:19083, coord_addr=172.16.172.233:19083, is_server=true]]
2008-03-18 00:46:47,500 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=172.16.172.234:19383, coord_addr=172.16.172.233:19383, is_server=true], [own_addr=172.16.172.233
:19383, coord_addr=172.16.172.233:19383, is_server=true]]
2008-03-18 00:46:47,556 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task started, coordinators are [172.16.172.234:19282, 172.16.172.233:19282]
2008-03-18 00:46:47,556 DEBUG [org.jgroups.protocols.pbcast.GMS] sending MERGE_REQ to [172.16.172.234:19282, 172.16.172.233:19282]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for merge responses
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=200000
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] sender=172.16.172.233:19282, merge_id=[172.16.172.233:19282|1205819207556]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] response=GmsHeader[MERGE_RSP]: view=[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282], digest=17
2.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1047 (1047)], merge_rejected=false, merge_id=[172.16.172.233:19282|1205819207556]
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2, actual responses=1
2008-03-18 00:46:47,557 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for merge responses
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task started, coordinators are [172.16.172.234:19382, 172.16.172.233:19382]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] sending MERGE_REQ to [172.16.172.234:19382, 172.16.172.233:19382]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 10000 msecs for merge responses
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=200000
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] sender=172.16.172.233:19382, merge_id=[172.16.172.233:19382|1205819207559]
2008-03-18 00:46:47,559 DEBUG [org.jgroups.protocols.pbcast.GMS] response=GmsHeader[MERGE_RSP]: view=[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382], digest=17
2.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1933 (1933)], merge_rejected=false, merge_id=[172.16.172.233:19382|1205819207559]
2008-03-18 00:46:47,560 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2, actual responses=1
2008-03-18 00:46:47,560 DEBUG [org.jgroups.protocols.pbcast.GMS] waiting 9999 msecs for merge responses
2008-03-18 00:46:47,592 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2, actual responses=2
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] num_rsps_expected=2, actual responses=2
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is sender=172.16.172.233:19282, view=[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282
], digest=172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1047 (1047)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is sender=172.16.172.233:19382, view=[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382
], digest=172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1933 (1933)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is sender=172.16.172.234:19282, view=[172.16.172.234:19282|4] [172.16.172.234:19282], digest=172.16.172.2
34:19282: [1043 : 1048 (1048)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] merge data is sender=172.16.172.234:19382, view=[172.16.172.234:19382|6] [172.16.172.234:19382], digest=172.16.172.2
34:19382: [1916 : 1934 (1934)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] new merged view will be MergeView::[172.16.172.233:19282|5] [172.16.172.233:19282, 172.16.172.234:19282], subgroups=
[[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282], [172.16.172.234:19282|4] [172.16.172.234:19282]]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] new merged view will be MergeView::[172.16.172.233:19382|7] [172.16.172.233:19382, 172.16.172.234:19382], subgroups=
[[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382], [172.16.172.234:19382|6] [172.16.172.234:19382]]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] consolidated digest=172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1048 (1048)]
2008-03-18 00:46:47,593 DEBUG [org.jgroups.protocols.pbcast.GMS] consolidated digest=172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1934 (1934)]
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task terminated
2008-03-18 00:46:47,594 DEBUG [org.jgroups.protocols.pbcast.GMS] merge task terminated
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS] view=MergeView::[172.16.172.233:19282|5] [172.16.172.233:19282, 172.16.172.234:19282], subgroups=[[172.16.172.233:19
282|3] [172.16.172.233:19282, 172.16.172.234:19282], [172.16.172.234:19282|4] [172.16.172.234:19282]], digest=172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043
 : 1048 (1048)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS] view=MergeView::[172.16.172.233:19382|7] [172.16.172.233:19382, 172.16.172.234:19382], subgroups=[[172.16.172.233:19
382|5] [172.16.172.233:19382, 172.16.172.234:19382], [172.16.172.234:19382|6] [172.16.172.234:19382]], digest=172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 
1934 (1934)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.NAKACK] existing digest:  172.16.172.233:19282: [8391 : 8450 (8450)], 172.16.172.234:19282: [1043 : 1047 (1047)]
new digest:       172.16.172.233:19282: [8391 : 8449 (8449)], 172.16.172.234:19282: [1043 : 1048 (1048)]
resulting digest: 172.16.172.233:19282: [8391 : 8450 (8450)], 172.16.172.234:19282: [1043 : 1048 (1048)]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=172.16.172.233:19282] view is MergeView::[172.16.172.233:19282|5] [172.16.172.233:19282, 172.16.172.234:
19282], subgroups=[[172.16.172.233:19282|3] [172.16.172.233:19282, 172.16.172.234:19282], [172.16.172.234:19282|4] [172.16.172.234:19282]]
2008-03-18 00:46:47,649 DEBUG [org.jgroups.protocols.pbcast.NAKACK] existing digest:  172.16.172.233:19382: [446 : 449 (449)], 172.16.172.234:19382: [1916 : 1933 (1933)]
new digest:       172.16.172.233:19382: [446 : 448 (448)], 172.16.172.234:19382: [1916 : 1934 (1934)]
resulting digest: 172.16.172.233:19382: [446 : 449 (449)], 172.16.172.234:19382: [1916 : 1934 (1934)]
2008-03-18 00:46:47,650 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=172.16.172.233:19382] view is MergeView::[172.16.172.233:19382|7] [172.16.172.233:19382, 172.16.172.234:
19382], subgroups=[[172.16.172.233:19382|5] [172.16.172.233:19382, 172.16.172.234:19382], [172.16.172.234:19382|6] [172.16.172.234:19382]]

And afterwards, the channel seems to get back to normal.

> Unable to recover from suspect/merge, with auto-reconnect
> ---------------------------------------------------------
>
>                 Key: JGRP-692
>                 URL: http://jira.jboss.com/jira/browse/JGRP-692
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>         Environment: JBoss 4.2.2 on Linux 2.4.21-47.0.1.ELsmp i686 using Java HotSpot(TM) Server VM (build 1.5.0_10-b03, mixed mode)
>            Reporter: Matt Magoffin
>         Assigned To: Bela Ban
>             Fix For: 2.6.3
>
>         Attachments: jgroups-logs.tbz2
>
>
> I'm having an issue with a 2-machine cluster using a TCP stack
> based on the tcp.xml from JGroups 2.6.1. On each machine I have 8 separate
> channels running, on different ports, with 4 groups in 2 JVM instances.
> After some period of time, one machine will fail to respond to a FD ping,
> and gets suspected. The machine that failed is not responding in time it
> seems from high CPU use, and many of the channels will fail FD around the
> same time. The channels are configured with auto-reconnect. My
> understanding was that the channel should "heal" itself and eventually
> re-form into a new view with the same 2 members in the cluster, which
> should apply to this situation because the machine that failed to respond
> eventually will respond.
> However, the group does not always seem to "heal" (sometimes it does,
> sometimes not). Once it stops healing, it never seems to ever do so again,
> and I get tons of NAKACK "message X not found in retransmission table"
> ERROR logs. The only way to get the channel working agin is to shut down
> the channel on both machines and then start them up again.
> I'm not using muxed channels... just normal channels. 
> I have for now disabled shunning, and the channels seem to be able to
> re-connect after a shun situation occurs, but after time I'm still seeing
> something wrong with the channel in that the nodes are not able to send
> messages to each other successfully, and I have tons of
> 2008-02-21 21:01:53,572 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> log entries, even while FD is receiving acks on that channel.
> this is a degradation problem for me while my servers are running,
> not just during deployment. After a while of running, the nodes get
> shunned/disconnected some how (often from a slow response from the other
> node) and then fail to ever merge back to form the original view of the 2
> nodes again.
> Now even with shun set to false in both FD and pbcast.GMS on both nodes
> for all channels, eventually still the nodes reach the same state of never
> re-forming... and I have tons eventual
> 2008-02-21 21:28:24,857 WARN  [org.jgroups.protocols.pbcast.NAKACK]
> 172.16.172.233:19182] discarded message from non-member
> 172.16.172.234:19182, my view is [172.16.172.233:19182|8]
> [172.16.172.233:19182]
> After a while in my application I'll force the channel to close wait a
> minute, and reopen, and sometimes this gets the channel working again, but
> doesn't always seem to.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira