[JBoss JIRA] (JGRP-1876) MERGE3 : Strange number and content of subgroups

Tuesday, 13 January 2015

    [
https://issues.jboss.org/browse/JGRP-1876?page=com.atlassian.jira.plugin....
] 

Bela Ban edited comment on JGRP-1876 at 1/13/15 2:07 AM:
---------------------------------------------------------

Another (perhaps different) issue brought up by Dan:
I might have brought this up before, but I'm seeing some weird (and quite rare) stuff
with MERGE3 and multiple partitions merging at the same time.

I have a cluster with three nodes: S, T, and U. I split it in 3 partitions with DISCARD,
then I disable DISCARD on a single node (U, which will be the merge coordinator when the
partitions merge back). Then I create another node V and make it join U's cluster.

After this, I disable DISCARD on S and T, and I wait for the partitions to join.
Sometimes, however, the merge doesn't go as planned, and U is excluded from the
merge:

{noformat}
11:13:06,230 DEBUG (Incoming-1,U:) [GMS] U: installing view [U|4] (2) [U, V]
11:13:06,251 DEBUG (testng-ClusterTopologyManagerTest:) [GMS] V: installing view [U|4] (2)
[U, V]
11:13:06,264 DEBUG (testng-ClusterTopologyManagerTest:) [ClusterTopologyManagerTest]
Merging the cluster partitions
11:13:06,271 TRACE (ViewHandler,U:) [GMS] U: got all ACKs (1) from joiners for view [U|4]
11:13:12,721 TRACE (Timer-2,T:) [MERGE3] discovery protocol returned 3 responses: 3 rsps
(2 coords) [done]
11:13:13,059 DEBUG (Timer-3,T:) [MERGE3] I (T) will be the merge leader
11:13:13,059 TRACE (Timer-3,T:) [MERGE3] merge participants are [T, V, S]
11:13:13,101 DEBUG (ViewHandler,T:) [Merger] T: I will be the leader. Starting the merge
task for 4 coords
11:13:13,102 DEBUG (MergeTask,T:) [Merger] T: merge task T::4 started with 3 coords
11:13:13,102 TRACE (INT-2,T:) [GMS] T: got merge response from T, merge_id=T::4, merge
data is sender=T, view=[T|3] (1) [T], digest=T: [1 (1)]
11:13:13,143 TRACE (INT-1,T:) [GMS] T: got merge response from S, merge_id=T::4, merge
data is sender=S, view=[S|4] (1) [S], digest=S: [15 (15)]
11:13:13,143 TRACE (INT-2,T:) [GMS] T: got merge response from V, merge_id=T::4, merge
data is sender=V, view=[U|4] (1) [V], digest=V: [0 (0)]
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: installing merge view [T|5] (3 members) in 3
coords
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: merge T::4 took 41 ms
11:13:13,143 TRACE (Incoming-1,T:) [GMS] T: mcasting view MergeView::[T|5] (3) [T, S, V],
2 subgroups: [S|4] (1) [S], [T|3] (1) [T] (3 mbrs)
{noformat}

I'm not sure why, but V suddenly reports being the only one in view U|4, and the
number of coords is 2, then 4, and then 3. Can you take a look if maybe there's
something wrong with the logging?

Ideally I'd like U to always be part of the merge, but I'm not sure if it's
missing because Discovery.findMembers is called with async = true or because there's a
problem with our TEST_PING protocol implementation. The next best thing would be to
exclude V from the merge, since it's not really a coordinator.

I haven't been able to reproduce this even when I re-enabled DISCARD on U before
disabling it on S and T, so I've given up on obtaining a reliable reproducer for now.
I have attached the full log, in case there's something else there that could help.

was (Author: belaban):
Another (perhaps different) issue brought up by Dan:
I might have brought this up before, but I'm seeing some weird (and quite rare) stuff
with MERGE3 and multiple partitions merging at the same time.

I have a cluster with three nodes: S, T, and U. I split it in 3 partitions with DISCARD,
then I disable DISCARD on a single node (U, which will be the merge coordinator when the
partitions merge back). Then I create another node V and make it join U's cluster.

After this, I disable DISCARD on S and T, and I wait for the partitions to join.
Sometimes, however, the merge doesn't go as planned, and U is excluded from the
merge:

11:13:06,230 DEBUG (Incoming-1,U:) [GMS] U: installing view [U|4] (2) [U, V]
11:13:06,251 DEBUG (testng-ClusterTopologyManagerTest:) [GMS] V: installing view [U|4] (2)
[U, V]
11:13:06,264 DEBUG (testng-ClusterTopologyManagerTest:) [ClusterTopologyManagerTest]
Merging the cluster partitions
11:13:06,271 TRACE (ViewHandler,U:) [GMS] U: got all ACKs (1) from joiners for view [U|4]
11:13:12,721 TRACE (Timer-2,T:) [MERGE3] discovery protocol returned 3 responses: 3 rsps
(2 coords) [done]
11:13:13,059 DEBUG (Timer-3,T:) [MERGE3] I (T) will be the merge leader
11:13:13,059 TRACE (Timer-3,T:) [MERGE3] merge participants are [T, V, S]
11:13:13,101 DEBUG (ViewHandler,T:) [Merger] T: I will be the leader. Starting the merge
task for 4 coords
11:13:13,102 DEBUG (MergeTask,T:) [Merger] T: merge task T::4 started with 3 coords
11:13:13,102 TRACE (INT-2,T:) [GMS] T: got merge response from T, merge_id=T::4, merge
data is sender=T, view=[T|3] (1) [T], digest=T: [1 (1)]
11:13:13,143 TRACE (INT-1,T:) [GMS] T: got merge response from S, merge_id=T::4, merge
data is sender=S, view=[S|4] (1) [S], digest=S: [15 (15)]
11:13:13,143 TRACE (INT-2,T:) [GMS] T: got merge response from V, merge_id=T::4, merge
data is sender=V, view=[U|4] (1) [V], digest=V: [0 (0)]
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: installing merge view [T|5] (3 members) in 3
coords
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: merge T::4 took 41 ms
11:13:13,143 TRACE (Incoming-1,T:) [GMS] T: mcasting view MergeView::[T|5] (3) [T, S, V],
2 subgroups: [S|4] (1) [S], [T|3] (1) [T] (3 mbrs)

I'm not sure why, but V suddenly reports being the only one in view U|4, and the
number of coords is 2, then 4, and then 3. Can you take a look if maybe there's
something wrong with the logging?

Ideally I'd like U to always be part of the merge, but I'm not sure if it's
missing because Discovery.findMembers is called with async = true or because there's a
problem with our TEST_PING protocol implementation. The next best thing would be to
exclude V from the merge, since it's not really a coordinator.

I haven't been able to reproduce this even when I re-enabled DISCARD on U before
disabling it on S and T, so I've given up on obtaining a reliable reproducer for now.
I have attached the full log, in case there's something else there that could help.

...
 MERGE3 : Strange number and content of subgroups
 ------------------------------------------------

                 Key: JGRP-1876
                 URL: https://issues.jboss.org/browse/JGRP-1876
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.4.2
            Reporter: Karim AMMOUS
            Assignee: Bela Ban
             Fix For: 3.5.1, 3.6, 3.6.2

         Attachments: 4Subgroups.zip, DkeJgrpAddress.java, MergeTest4.java,
MergeViewWith210Subgroups.log, SplitMergeTest.java, views.txt

 Using JGroups 3.4.2, a split occurred and a merge was processed successfully but number
of subgroups is wrong (210 instead of 2).
 The final mergeView is correct and contains 210 members.
 Here is an extract of subviews: 
 {code}
 INFO | Incoming-18,cluster,term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F] |
[MyMembershipListener.java:126] | (middleware) | MergeView view ID =
[serv-ZM2BU35940-58033:vt-14:192.168.55.55:1:CL(GROUP01)[F]|172]
 210 subgroups 
 [....
 [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[term-ETJ104215245-11092:host:192.168.56.72:1:CL(GROUP01)[F]]
 [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU38960-6907:asb:192.168.55.52:1:CL(GROUP01)[F]]
 [term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]|171] (1)
[term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]]
 [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU47533-55240:vt-14:192.168.55.57:1:CL(GROUP01)[F]]
 [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU35943-49435:asb:192.168.55.51:1:CL(GROUP01)[F]]
 ....]
 {code}
 II wasn't able to reproduce that with a simple program. But I observed that merge was
preceded by an ifdown/ifup on host 192.168.56.6. That member lost all others members, but
it still present in their view.
 Example:  
 {code}
 {A, B, C} => {A, B, C} and {C} => {A, B, C}
 {code} 

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006