[jboss-jira] [JBoss JIRA] (JGRP-2435) ClientGmsImpl ignores newer view during join

Wed Jan 15 09:19:30 EST 2020

    [ https://issues.redhat.com/browse/JGRP-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949212#comment-13949212 ] 

Dan Berindei commented on JGRP-2435:
------------------------------------

I was able to reproduce the issue in {{org.jgroups.tests.ConcurrentStartupTest.testConcurrentJoinWithLOCAL_PING}}.

I added {{@Test(invocationCount = 100)}} and an {{Uphandler}} that pretends to do some work on the first view:

{code:java}
        channel.setUpHandler(new UpHandler() {
            boolean hasView = false;
            @Override
            public Object up(Event evt) {
                if (evt.getType() == Event.VIEW_CHANGE && !hasView) {
                    hasView = true;
                    try {
                        Thread.sleep(ThreadLocalRandom.current().nextInt(50));
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                return null;
            }

            @Override
            public Object up(Message msg) {
                return null;
            }
        });
{code}

> ClientGmsImpl ignores newer view during join
> --------------------------------------------
>
>                 Key: JGRP-2435
>                 URL: https://issues.redhat.com/browse/JGRP-2435
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.1.9
>            Reporter: Dan Berindei
>            Assignee: Bela Ban
>            Priority: Major
>             Fix For: 5.0, 4.2.0
>
>
> We have random failures in a test that starts 4 nodes in parallel ({{org.infinispan.InitialClusterSizeTest.testInitialClusterSize}}). I get at least one failure if I add {{@Test(invocationCount=100)}}, but I did not get any failure when I did the same with {{org.jgroups.tests.ConcurrentStartupTest.testConcurrentJoinWithLOCAL_PING()}}, maybe because the Infinispan test sends some additional messages and sometimes changes how messages are processed.
> The problem seems to be that after receiving a {{JOIN_RSP}} and installing the first view, the GMS implementation is still {{ClientGmsImpl}} when receiving the second view, and the view is ignored. Because this second view already has all 4 members, there is no other {{VIEW}} message and the test just times out.
> I added some logs in TP and GMS for debugging, and this is what I see:
> {noformat}
> 10:37:42,896 TRACE (ucast-receiver-2,Test-NodeC:[]) [UDP] Received oob=false internal=true message GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,898 TRACE (mcast-receiver-3,Test-NodeC:[]) [UDP] Received oob=false internal=false message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,906 TRACE (jgroups-4,Test-NodeC:[]) [UDP] Test-NodeC: received [Test-NodeB to Test-NodeC, 61 bytes, flags=INTERNAL], headers are GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,906 TRACE (jgroups-4,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,907 DEBUG (ForkThread-1,InitialClusterSizeTest:[]) [GMS] Test-NodeC: installing view [Test-NodeB|1] (2) [Test-NodeB, Test-NodeC]
> 10:37:42,947 TRACE (mcast-receiver-3,Test-NodeC:[]) [UDP] Received oob=false internal=false message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: received message batch of from Test-NodeB: dest=null, sender=Test-NodeB
> 1:
> #1: GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: received message batch of from Test-NodeB: dest=null, sender=Test-NodeB
> 1:
> #1: GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> ### GmsImpl.handleViewChange() {}
> 10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] GmsImpl ignoring view update [Test-NodeB|2] (4) [Test-NodeB, Test-NodeC, Test-NodeA, Test-NodeD]
> 10:37:42,950 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: sending msg to Test-NodeB, src=Test-NodeC, headers are GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=3, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> ### The ack for the JOIN_RSP is only sent here
> 10:37:43,034 TRACE (ForkThread-1,InitialClusterSizeTest:[]) [UDP] Test-NodeC: sending msg to Test-NodeB, src=Test-NodeC, headers are GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=4, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
> 10:37:43,034 DEBUG (ForkThread-1,InitialClusterSizeTest:[]) [JGroupsTransport] Waiting for 4 nodes, current view has 2
> {noformat}
> To help debugging, {{TP.passBatchUp}} should really log the headers of the messages in the batch, and {{GMS}} and the {{GmsImpl}} subclasses should log at least a DEBUG message every time they ignore a view (even when {{log_view_warnings==false}}). I also suggest removing the default implementations from {{GmsImpl}} and updating the {{ClientGmsImpl}} javadoc, because it talks about handling {{ViewChange}} instead of {{JoinResponse}}.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)