]
Bela Ban updated JGRP-1323:
---------------------------
Fix Version/s: 2.12.2
3.1
MERGE2 not getting all the coordinators from PING
-------------------------------------------------
Key: JGRP-1323
URL:
https://issues.jboss.org/browse/JGRP-1323
Project: JGroups
Issue Type: Bug
Affects Versions: 2.10
Environment: Windows, Linux
Reporter: vivek v
Assignee: Bela Ban
Fix For: 2.12.2, 3.1
Attachments: Partition1_log.txt, Partition2_log.txt, pktmtntunnel.xml
We got 17 nodes in our group. Due to some node up/down the group got divided into two
partitions. One partition had only 2 nodes in it and the other had 16. Even after all the
nodes were up we never got the merge between the two partitions. Here is what I see from
the logs,
1) On Partition 1 (coordinator: manager_192.168.50.22)
{noformat}
2011-05-11 01:18:48,980 TRACE [Timer-1,192.168.50.22_group,manager_192.168.50.22:4576]
PING - discovery took 5002 ms: responses: 7 total (7 servers (1 coord), 0 clients)
2011-05-11 01:18:48,980 TRACE [Timer-1,192.168.50.22_group,manager_192.168.50.22:4576]
MERGE2 - Discovery results:
[manager_192.168.50.22:4576]: [manager_192.168.50.22:4576|19]
[manager_192.168.50.22:4576, probe_192.168.50.80:4576]
[probe_192.168.50.65:4576]: MergeView::[probe_192.168.50.66:4576|30]
[probe_192.168.50.66:4576, probe_192.168.50.64:4576, probe_192.168.50.59:4576,
probe_192.168.50.58:4576, probe_192.168.50.81:4576, probe_192.168.50.63:4576,
probe_192.168.50.80:4576, collector_192.168.50.23:4576, probe_192.168.50.65:4576,
probe_192.168.50.62:4576, probe_192.168.50.60:4576, probe_192.168.50.69:4576,
probe_192.168.50.68:4576, probe_192.168.50.83:4576, probe_192.168.50.82:4576,
probe_192.168.50.61:4576, probe_192.168.50.67:4576], subgroups=[..
..
011-05-11 01:18:49,034 DEBUG [ViewHandler,192.168.50.22_group,manager_192.168.50.22:4576]
GMS - determining merge leader from [probe_192.168.50.68:4576, probe_192.168.50.66:4576,
probe_192.168.50.61:4576, manager_192.168.50.22:4576, probe_192.168.50.69:4576,
probe_192.168.50.58:4576, probe_192.168.50.65:4576, probe_192.168.50.60:4576,
probe_192.168.50.64:4576, probe_192.168.50.81:4576, probe_192.168.50.67:4576,
collector_192.168.50.23:4576, probe_192.168.50.63:4576, probe_192.168.50.82:4576,
probe_192.168.50.83:4576, probe_192.168.50.62:4576, probe_192.168.50.59:4576]
2011-05-11 01:18:49,036 DEBUG
[ViewHandler,192.168.50.22_group,manager_192.168.50.22:4576] GMS - I
(manager_192.168.50.22:4576) am not the merge leader, waiting for merge leader
(probe_192.168.50.66:4576) to initiate merge
{noformat}
2) On Partition 2 (probe_192.168.50.66)
{noformat}
2011-05-10 21:03:07,334 TRACE [Timer-1,192.168.50.22_group,probe_192.168.50.66:4576] PING
- discovery took 47 ms: responses: 17 total (17 servers (1 coord), 0 clients)
2011-05-10 21:03:07,335 TRACE [Timer-1,192.168.50.22_group,probe_192.168.50.66:4576]
MERGE2 - Discovery results:
[probe_192.168.50.66:4576]: MergeView::[probe_192.168.50.66:4576|30]
[probe_192.168.50.66:4576, probe_192.168.50.64:4576, probe_192.168.50.59:4576,
probe_192.168.50.58:4576, probe_192.168.50.81:4576, probe_192.168.50.63:4576,
probe_192.168.50.80:4576, collector_192.168.50.23:4576, probe_192.168.50.65:4576,
probe_192.168.50.62:4576, probe_192.168.50.60:4576, probe_192.168.50.69:4576,
probe_192.168.50.68:4576, probe_192.168.50.83:4576, probe_192.168.50.82:4576,
probe_192.168.50.61:4576, probe_192.168.50.67:4576], subgroups=[[..
{noformat}
The FIND_INITIAL_MBRS by second coordinator (probe_192.168.50.66) never get the manager
in it's list. We are using Tunnel protocol stack with two Gossip Router. Both the
coordinators are talking to the same GR. Here is our PING and MERGE configuration,
{code:xml}
<PING timeout="5000"
num_initial_members="3"/>
<MERGE2 max_interval="30000" min_interval="10000"/>
{code}
On coordinator 2 I also see,
{noformat}
2011-05-10 21:05:30,959 WARN [OOB-2330,192.168.50.22_group,probe_192.168.50.66:4576]
NAKACK - probe_192.168.50.66:4576: dropped message from manager_192.168.50.22:4576 (not in
xmit_table), keys are [probe_192.168.50.60:4576, probe_192.168.50.63:4576, ...
...
{noformat}
MERGE2 seems to rely on DISCOVERY to get initial members, but for some reason the two
partitions are getting different results. Could it be because the Partition 1 got only two
members in it so it wait for the other members to give its view as well, but Partition 2
has a long list of members so it gets the view from them and doesn't wait for the
Partition 1 list.
Partition 1 also seem to wait for the merge leader - do we need to do that? In this case
the merge leader never comes by and thus, the merge between two subgroup never happen.
I'm not sure where the exact problem is, but I'm opening this as bug as we've
seen disjoint groups never merging on quite a few occasions.
Attached are detailed logs from the two coordinators.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: