[jboss-jira] [JBoss JIRA] (WFCORE-3302) Intermittent protocol and controller module unit test failures since move to JBoss Remoting 5

Fri Mar 16 18:26:00 EDT 2018

    [ https://issues.jboss.org/browse/WFCORE-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547469#comment-13547469 ] 

David Lloyd commented on WFCORE-3302:
-------------------------------------

Here's the update.

(Very) long story made (very) short, this may be due to a known NIO bug.  If a server is created and asynchronously closed in very rapid succession, NIO may inadvertently call bind() on a completely different socket, potentially leaving a zombie occupying the port.  This is the first plausible explanation that I've been able to discover for this behavior, or for why the workaround in XNIO (where we cancel the selection key from every selector before returning from close) makes the race go away.

I think it's safe to call this "fixed" since we haven't seen it again, and our observations and the known facts seem to fit together.  But it's not really "fixed", just "worked around" (until Java 11 where a proper fix is present).

> Intermittent protocol and controller module unit test failures since move to JBoss Remoting 5
> ---------------------------------------------------------------------------------------------
>
>                 Key: WFCORE-3302
>                 URL: https://issues.jboss.org/browse/WFCORE-3302
>             Project: WildFly Core
>          Issue Type: Bug
>          Components: Domain Management, Test Suite
>            Reporter: Brian Stansberry
>
> This bug is about problems in WF Core management tests. I believe it exposes a flaw in how remoting handles server sockets, but AFAIK there is no impact on WF Core remoting server sockets.
> Since the move to JBoss Remoting 5 we've seen intermittent failures in the protocol and controller module testsuites involving the tests that use their respective copies of the ChannelServer + RemoteChannelPairSetup test fixture. These tests all do a setup and teardown of the fixture for each test method (i.e. @Before and @After) with the failure being that a test fails creating a remoting server with a failure that indicates the server from a previous test hasn't completely shut down yet:
> {code}
> java.lang.RuntimeException: java.net.BindException: Address already in use: bind
>     at sun.nio.ch.Net.bind0(Native Method)
>     at sun.nio.ch.Net.bind(Net.java:433)
>     at sun.nio.ch.Net.bind(Net.java:425)
>     at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
>     at org.xnio.nio.NioXnioWorker.createTcpConnectionServer(NioXnioWorker.java:181)
>     at org.xnio.XnioWorker.createStreamConnectionServer(XnioWorker.java:282)
>     at org.jboss.remoting3.remote.RemoteConnectionProvider$ProviderInterface.createServer(RemoteConnectionProvider.java:372)
>     at org.jboss.as.controller.support.ChannelServer.create(ChannelServer.java:92)
>     at org.jboss.as.controller.support.RemoteChannelPairSetup.setupRemoting(RemoteChannelPairSetup.java:88)
>     at org.jboss.as.controller.ModelControllerClientTestCase.setupTestClient(ModelControllerClientTestCase.java:94)
>     at org.jboss.as.controller.ModelControllerClientTestCase.testCloseInputStreamEntry(ModelControllerClientTestCase.java:346)
> {code}
> These failures have been mildly annoying on ci.wildfly.org, but now that the same code is being on other test machines, e.g. brontes used for EAP testing, they are completely intolerable, affecting a high percentage of CI runs for pull requests.
> I believe the issue arises from changes to these fixtures that came in as part of the Remoting 5 upgrade such that a remoting Endpoint is not being created/shutdown for each test method. This causes a problem because the AcceptingChannel<StreamConnection> created by Endpoint.getConnectionProviderInterface(...).createServer(...) *does not* synchronously close down the underlying socket as part of a call to its close() method.
> The socket is not closed synchronously because the ServerSocketChannel impl of close() does not close the socket if there are any registered keys.  Debugging shows the socket is not closed until this stack happens:
> {code}
> "XNIO-1 Accept at 1562" daemon prio=5 tid=0xf nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> 	  at sun.nio.ch.ServerSocketChannelImpl.kill(ServerSocketChannelImpl.java:307)
> 	  - locked <0xc0d> (a java.lang.Object)
> 	  at sun.nio.ch.KQueueSelectorImpl.implDereg(KQueueSelectorImpl.java:229)
> 	  at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
> 	  - locked <0xc38> (a java.util.HashSet)
> 	  at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:107)
> 	  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> 	  - locked <0xc2b> (a sun.nio.ch.KQueueSelectorImpl)
> 	  - locked <0xc39> (a java.util.Collections$UnmodifiableSet)
> 	  - locked <0xc3a> (a sun.nio.ch.Util$2)
> 	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> 	  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
> 	  at org.xnio.nio.WorkerThread.run(WorkerThread.java:519)
> {code}
> That thread is not under the control of the test fixture, which means there's a race between it closing the socket and the test moving on the next setup where it tries to open the socket.
> I think the only solution for this is to bring the endpoint lifecycle back under the control of the test fixture such that the fixture knows all is shutdown. I don't see anything else the test can block on to ensure the server socket is closed.
> I think this would be a bug for any use of remoting where a server may quickly be shutdown and then recreated.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)