[
https://issues.jboss.org/browse/JGRP-1876?page=com.atlassian.jira.plugin....
]
Karim AMMOUS commented on JGRP-1876:
------------------------------------
I'm trying to address the original issue for which I created this issue.
*Issue description*
*T-0* : C is isolated from A and B. So two views are installed : {noformat}[A|1](2){A,B}
on (A,B) and [C|0](1){C} on (C){noformat}
*T-1* : Stop isolation of C
*T-2* : C detects inconsistent views {noformat}([A|1] vs [C|0]){noformat} based only on
MergeInfo coming from B and itself (C)
but without A's MergeInfo in spite of being a coord because it was sent while C was
still isolated.
*T-3* : C decides to install new view {noformat}[C|A]{2){C,B}{noformat}. All members
install C's view except A since it isn't in that view.
{noformat}
A : [A|1](2){A,B}
B : [C|A]{2){C,B}
C : [C|A]{2){C,B}
{noformat}
*T-4* : A detects that view are still incoherent (asymmetric). It decides to merge again.
Resulting view: {noformat}[A|2](3){A,B,C}{noformat}
*Effects*
1/ When network problem (isolation, flapping, etc) occurs on one member, all other members
(like B) may lose their coord.
2/ Two coord changes occurred : A -> C -> A
3/ C installs strange mergeview subgroups: {noformat} [A|1](1)(B), [A|1](1)[D],
[A|1](1)[E], ..., [C|0](1)[C]. {noformat} (see enclosed file
"MergeViewWith210Subgroups.log").
*Fix*
Only the third effect had been fixed thanks to your patch.
Please take a look at the [
PR|https://github.com/belaban/JGroups/pull/200] that should fix
the root cause.
Principle: If the merge leader didn't receive a MergeInfo from a real coord (present
in a viewId), it checks the "age" of oldest viewId received.
If viewId is not really old (age < check_interval-min_interval), we interrupt merge
process.
MERGE3 : Strange number and content of subgroups
------------------------------------------------
Key: JGRP-1876
URL:
https://issues.jboss.org/browse/JGRP-1876
Project: JGroups
Issue Type: Bug
Affects Versions: 3.4.2
Reporter: Karim AMMOUS
Assignee: Bela Ban
Fix For: 3.5.1, 3.6, 3.6.2
Attachments: 4Subgroups.zip, ChannelCreator.java, DkeJgrpAddress.java,
JGRP-1876-1.pdf, karim-logs-files.zip, MergeTest4.java, MergeViewWith210Subgroups.log,
SplitMergeTest.java, views.txt
Using JGroups 3.4.2, a split occurred and a merge was processed successfully but number
of subgroups is wrong (210 instead of 2).
The final mergeView is correct and contains 210 members.
Here is an extract of subviews:
{code}
INFO | Incoming-18,cluster,term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F] |
[MyMembershipListener.java:126] | (middleware) | MergeView view ID =
[serv-ZM2BU35940-58033:vt-14:192.168.55.55:1:CL(GROUP01)[F]|172]
210 subgroups
[....
[term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[term-ETJ104215245-11092:host:192.168.56.72:1:CL(GROUP01)[F]]
[term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU38960-6907:asb:192.168.55.52:1:CL(GROUP01)[F]]
[term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]|171] (1)
[term-ETJ101697729-31726:host:192.168.56.6:1:CL(GROUP01)[F]]
[term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU47533-55240:vt-14:192.168.55.57:1:CL(GROUP01)[F]]
[term-ETJ100691812-36873:host:192.168.56.16:1:CL(GROUP01)[F]|170] (1)
[serv-ZM2BU35943-49435:asb:192.168.55.51:1:CL(GROUP01)[F]]
....]
{code}
II wasn't able to reproduce that with a simple program. But I observed that merge was
preceded by an ifdown/ifup on host 192.168.56.6. That member lost all others members, but
it still present in their view.
Example:
{code}
{A, B, C} => {A, B, C} and {C} => {A, B, C}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)