[jboss-jira] [JBoss JIRA] (JGRP-1876) MERGE3 : Strange number and content of subgroups
Bela Ban (JIRA)
issues at jboss.org
Tue Jan 13 02:06:49 EST 2015
[ https://issues.jboss.org/browse/JGRP-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032020#comment-13032020 ]
Bela Ban commented on JGRP-1876:
--------------------------------
Another (perhaps different) issue brought up by Dan:
I might have brought this up before, but I'm seeing some weird (and quite rare) stuff with MERGE3 and multiple partitions merging at the same time.
I have a cluster with three nodes: S, T, and U. I split it in 3 partitions with DISCARD, then I disable DISCARD on a single node (U, which will be the merge coordinator when the partitions merge back). Then I create another node V and make it join U's cluster.
After this, I disable DISCARD on S and T, and I wait for the partitions to join. Sometimes, however, the merge doesn't go as planned, and U is excluded from the merge:
11:13:06,230 DEBUG (Incoming-1,U:) [GMS] U: installing view [U|4] (2) [U, V]
11:13:06,251 DEBUG (testng-ClusterTopologyManagerTest:) [GMS] V: installing view [U|4] (2) [U, V]
11:13:06,264 DEBUG (testng-ClusterTopologyManagerTest:) [ClusterTopologyManagerTest] Merging the cluster partitions
11:13:06,271 TRACE (ViewHandler,U:) [GMS] U: got all ACKs (1) from joiners for view [U|4]
11:13:12,721 TRACE (Timer-2,T:) [MERGE3] discovery protocol returned 3 responses: 3 rsps (2 coords) [done]
11:13:13,059 DEBUG (Timer-3,T:) [MERGE3] I (T) will be the merge leader
11:13:13,059 TRACE (Timer-3,T:) [MERGE3] merge participants are [T, V, S]
11:13:13,101 DEBUG (ViewHandler,T:) [Merger] T: I will be the leader. Starting the merge task for 4 coords
11:13:13,102 DEBUG (MergeTask,T:) [Merger] T: merge task T::4 started with 3 coords
11:13:13,102 TRACE (INT-2,T:) [GMS] T: got merge response from T, merge_id=T::4, merge data is sender=T, view=[T|3] (1) [T], digest=T: [1 (1)]
11:13:13,143 TRACE (INT-1,T:) [GMS] T: got merge response from S, merge_id=T::4, merge data is sender=S, view=[S|4] (1) [S], digest=S: [15 (15)]
11:13:13,143 TRACE (INT-2,T:) [GMS] T: got merge response from V, merge_id=T::4, merge data is sender=V, view=[U|4] (1) [V], digest=V: [0 (0)]
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: installing merge view [T|5] (3 members) in 3 coords
11:13:13,143 DEBUG (MergeTask,T:) [Merger] T: merge T::4 took 41 ms
11:13:13,143 TRACE (Incoming-1,T:) [GMS] T: mcasting view MergeView::[T|5] (3) [T, S, V], 2 subgroups: [S|4] (1) [S], [T|3] (1) [T] (3 mbrs)
I'm not sure why, but V suddenly reports being the only one in view U|4, and the number of coords is 2, then 4, and then 3. Can you take a look if maybe there's something wrong with the logging?
Ideally I'd like U to always be part of the merge, but I'm not sure if it's missing because Discovery.findMembers is called with async = true or because there's a problem with our TEST_PING protocol implementation. The next best thing would be to exclude V from the merge, since it's not really a coordinator.
I haven't been able to reproduce this even when I re-enabled DISCARD on U before disabling it on S and T, so I've given up on obtaining a reliable reproducer for now. I have attached the full log, in case there's something else there that could help.
> MERGE3 : Strange number and content of subgroups
> ------------------------------------------------
>
> Key: JGRP-1876
> URL: https://issues.jboss.org/browse/JGRP-1876
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.4.2
> Reporter: Karim AMMOUS
> Assignee: Bela Ban
> Fix For: 3.5.1, 3.6, 3.6.2
>
> Attachments: 4Subgroups.zip, DkeJgrpAddress.java, MergeTest4.java, MergeViewWith210Subgroups.log, SplitMergeTest.java, views.txt
>
>
> Using JGroups 3.4.2, a split occurred and a merge was processed successfully but number of subgroups is wrong (210 instead of 2).
> The final mergeView is correct and contains 210 members.
> Here is an extract of subviews:
> {code}
> INFO | Incoming-18,cluster,term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F] | [MyMembershipListener.java:126] | (middleware) | MergeView view ID = [serv-ZM2BU35940-58033:vt-14:192.168.55.55:1:CL(GROUP01)[F]|172]
> 210 subgroups
> [....
> [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1) [term-ETJ104215245-11092:host:192.168.56.72:1:CL(GROUP01)[F]]
> [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1) [serv-ZM2BU38960-6907:asb:192.168.55.52:1:CL(GROUP01)[F]]
> [term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]|171] (1) [term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]]
> [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1) [serv-ZM2BU47533-55240:vt-14:192.168.55.57:1:CL(GROUP01)[F]]
> [term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1) [serv-ZM2BU35943-49435:asb:192.168.55.51:1:CL(GROUP01)[F]]
> ....]
> {code}
> II wasn't able to reproduce that with a simple program. But I observed that merge was preceded by an ifdown/ifup on host 192.168.56.6. That member lost all others members, but it still present in their view.
> Example:
> {code}
> {A, B, C} => {A, B, C} and {C} => {A, B, C}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
More information about the jboss-jira
mailing list