]
Tomasz Adamski commented on WFLY-13956:
---------------------------------------
[~cfang] I think I was able to reproduce TwoConnectorsEJBFailoverTestCase locally - though
it requires a lot of repetitions before the test finally hangs. What is more I wasn't
able to reproduce is with EJBCLIENT-356 commit reverted and was able to reproduce it with
EJBCLIENT-376 reverted, which would confirm our initial suspicion that EJBCLIENT-356 is a
culprit.
I was able to plug with debug to maven thread and saw that it is stuck waiting on
awaitResponse, but I wasn't able to find out yet what happened to the two nodes that
lead to this situation. I'm suspecting that there is some discovery bug when dealing
with starting/stopping nodes after introduction of eagerNodes list
to RemotingEJBDiscoveryProvider but I don't know yet what is the exact failure
scenario.
TwoConnectorsEJBFailoverTestCase sometimes hangs when running with
jboss-ejb-client 4.0.35
------------------------------------------------------------------------------------------
Key: WFLY-13956
URL:
https://issues.redhat.com/browse/WFLY-13956
Project: WildFly
Issue Type: Bug
Components: EJB
Affects Versions: 21.0.0.Beta1
Reporter: Cheng Fang
Assignee: Cheng Fang
Priority: Major
Attachments: Screen Shot 2020-10-12 at 9.57.36 AM.png
When upgrading jboss-ejb-client to version 4.0.35.Final, we noticed some WildFly CI jobs
sometimes hang when running {{TwoConnectorsEJBFailoverTestCase}}, causing 100+ (e.g., 466)
tests to fail:
{code}
RemoteElytronSingleSignOnTestCase.testSessionTimeoutDestroysSSOorg.jboss.as.test.clustering.cluster.sso.remote
RemoteElytronSingleSignOnTestCase.testFormAuthSingleSignOnorg.jboss.as.test.clustering.cluster.sso.remote
RemoteElytronSingleSignOnTestCase.testNoAuthSingleSignOnorg.jboss.as.test.clustering.cluster.sso.remote
RemoteSingleSignOnTestCase.testSessionTimeoutDestroysSSOorg.jboss.as.test.clustering.cluster.sso.remote
RemoteSingleSignOnTestCase.testFormAuthSingleSignOnorg.jboss.as.test.clustering.cluster.sso.remote
RemoteSingleSignOnTestCase.testNoAuthSingleSignOnorg.jboss.as.test.clustering.cluster.sso.remote
CoarseHotRodPersistenceWebFailoverTestCase.testGracefulUndeployFailoverorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodPersistenceWebFailoverTestCase.testNonPrimaryOwnerorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodPersistenceWebFailoverTestCase.testGracefulSimpleFailoverorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodSessionActivationTestCase.testorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodWebFailoverTestCase.testGracefulUndeployFailoverorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodWebFailoverTestCase.testNonPrimaryOwnerorg.jboss.as.test.clustering.cluster.web.remote
CoarseHotRodWebFailoverTestCase.testGracefulSimpleFailoverorg.jboss.as.test.clustering.cluster.web.remote
CoarseTransactionalHotRodWebFailoverTestCase.testGracefulUndeployFailoverorg.jboss.as.test.clustering.cluster.web.remote
CoarseTransactionalHotRodWebFailoverTestCase.testNonPrimaryOwnerorg.jboss.as.test.clustering.cluster.web.remote
CoarseTransactionalHotRodWebFailoverTestCase.testGracefulSimpleFailoverorg.jboss.as.test.clustering.cluster.web.remote
FineHotRodPersistenceWebFailoverTestCase.testGracefulUndeployFailoverorg.jboss.as.test.clustering.cluster.web.remote
{code}
The error messages are like:
{code}
org.jboss.arquillian.container.spi.client.container.LifecycleException: The port 9990 is
already in use. It means that either the server might be already running or there is
another process using port 9990.
Managed containers do not support connecting to running server instances due to the
possible harmful effect of connecting to the wrong server.
Please stop server (or another process) before running, change to another type of
container (e.g. remote) or use jboss.socket.binding.port-offset variable to change the
default port.
To disable this check and allow Arquillian to connect to a running server, set
allowConnectingToRunningServer to true in the container configuration
{code}
In a [successful CI
job|https://ci.wildfly.org/buildConfiguration/WFPR/226438?],
{{TwoConnectorsEJBFailoverTestCase}} runs 2 tests:
{{testEJBClientUsingLegacyRemotingProtocol}} and
{{testEJBClientUsingHttpUpgradeProtocol}}, each taking ~25 seconds to complete. See
attached screenshot.
But in a [failed CI
job|https://ci.wildfly.org/buildConfiguration/WF_PullRequest_LinuxJdk11/2...],
filtering by {{TwoConnectorsEJBFailoverTestCase}} gives no result, which means the test
hangs at {{testEJBClientUsingLegacyRemotingProtocol}} and so the test runner hasn't
been able to collect the result from either test.