[JBoss JIRA] (WFCORE-3302) Intermittent protocol and controller module unit test failures since move to JBoss Remoting 5

Tuesday, 19 September 2017

    [
https://issues.jboss.org/browse/WFCORE-3302?page=com.atlassian.jira.plugi...
] 

Brian Stansberry commented on WFCORE-3302:
------------------------------------------

I've submitted master/3.x/3.0.x PRs to workaround this (WFCORE-3302) by closing the
endpoint. What's interesting, is that simply doing this didn't solve the problem:

endpoint.close();

This, however, did:

endpoint.closeAsync();
endpoint.awaitClosed(); // with some try/catch boilerplate around it to handle
InterruptedException

...
From a naive point of view it seems like those two should be
semantically equivalent. I've reported this on REM3-303 which is the main JIRA
tracking this problem. 
...
 Intermittent protocol and controller module unit test failures since
move to JBoss Remoting 5

---------------------------------------------------------------------------------------------

                 Key: WFCORE-3302
                 URL: https://issues.jboss.org/browse/WFCORE-3302
             Project: WildFly Core
          Issue Type: Bug
          Components: Domain Management, Test Suite
            Reporter: Brian Stansberry
            Assignee: Brian Stansberry

 This bug is about problems in WF Core management tests. I believe it exposes a flaw in
how remoting handles server sockets, but AFAIK there is no impact on WF Core remoting
server sockets.
 Since the move to JBoss Remoting 5 we've seen intermittent failures in the protocol
and controller module testsuites involving the tests that use their respective copies of
the ChannelServer + RemoteChannelPairSetup test fixture. These tests all do a setup and
teardown of the fixture for each test method (i.e. @Before and @After) with the failure
being that a test fails creating a remoting server with a failure that indicates the
server from a previous test hasn't completely shut down yet:
 {code}
 java.lang.RuntimeException: java.net.BindException: Address already in use: bind
     at sun.nio.ch.Net.bind0(Native Method)
     at sun.nio.ch.Net.bind(Net.java:433)
     at sun.nio.ch.Net.bind(Net.java:425)
     at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
     at org.xnio.nio.NioXnioWorker.createTcpConnectionServer(NioXnioWorker.java:181)
     at org.xnio.XnioWorker.createStreamConnectionServer(XnioWorker.java:282)
     at
org.jboss.remoting3.remote.RemoteConnectionProvider$ProviderInterface.createServer(RemoteConnectionProvider.java:372)
     at org.jboss.as.controller.support.ChannelServer.create(ChannelServer.java:92)
     at
org.jboss.as.controller.support.RemoteChannelPairSetup.setupRemoting(RemoteChannelPairSetup.java:88)
     at
org.jboss.as.controller.ModelControllerClientTestCase.setupTestClient(ModelControllerClientTestCase.java:94)
     at
org.jboss.as.controller.ModelControllerClientTestCase.testCloseInputStreamEntry(ModelControllerClientTestCase.java:346)
 {code}
 These failures have been mildly annoying on ci.wildfly.org, but now that the same code is
being on other test machines, e.g. brontes used for EAP testing, they are completely
intolerable, affecting a high percentage of CI runs for pull requests.
 I believe the issue arises from changes to these fixtures that came in as part of the
Remoting 5 upgrade such that a remoting Endpoint is not being created/shutdown for each
test method. This causes a problem because the AcceptingChannel<StreamConnection>
created by Endpoint.getConnectionProviderInterface(...).createServer(...) *does not*
synchronously close down the underlying socket as part of a call to its close() method.
 The socket is not closed synchronously because the ServerSocketChannel impl of close()
does not close the socket if there are any registered keys.  Debugging shows the socket is
not closed until this stack happens:
 {code}
 "XNIO-1 Accept@1562" daemon prio=5 tid=0xf nid=NA runnable
   java.lang.Thread.State: RUNNABLE
 	  at sun.nio.ch.ServerSocketChannelImpl.kill(ServerSocketChannelImpl.java:307)
 	  - locked <0xc0d> (a java.lang.Object)
 	  at sun.nio.ch.KQueueSelectorImpl.implDereg(KQueueSelectorImpl.java:229)
 	  at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
 	  - locked <0xc38> (a java.util.HashSet)
 	  at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:107)
 	  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
 	  - locked <0xc2b> (a sun.nio.ch.KQueueSelectorImpl)
 	  - locked <0xc39> (a java.util.Collections$UnmodifiableSet)
 	  - locked <0xc3a> (a sun.nio.ch.Util$2)
 	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
 	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
 	  at org.xnio.nio.WorkerThread.run(WorkerThread.java:519)
 {code}
 That thread is not under the control of the test fixture, which means there's a race
between it closing the socket and the test moving on the next setup where it tries to open
the socket.
 I think the only solution for this is to bring the endpoint lifecycle back under the
control of the test fixture such that the fixture knows all is shutdown. I don't see
anything else the test can block on to ensure the server socket is closed.
 I think this would be a bug for any use of remoting where a server may quickly be
shutdown and then recreated. 

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006