[jboss-jira] [JBoss JIRA] (WFLY-10474) AMQ119012: Timed out waiting to receive initial broadcast from cluster

Erich Duda (JIRA) issues at jboss.org
Mon Jul 16 08:36:01 EDT 2018


     [ https://issues.jboss.org/browse/WFLY-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erich Duda updated WFLY-10474:
------------------------------
    Priority: Blocker  (was: Critical)


> AMQ119012: Timed out waiting to receive initial broadcast from cluster
> ----------------------------------------------------------------------
>
>                 Key: WFLY-10474
>                 URL: https://issues.jboss.org/browse/WFLY-10474
>             Project: WildFly
>          Issue Type: Bug
>          Components: JMS
>    Affects Versions: 13.0.0.Beta1
>            Reporter: Erich Duda
>            Assignee: Jeff Mesnil
>            Priority: Blocker
>              Labels: blocker-WF14
>
> In messaging HA scenarios we see that tests sometimes fail because of exception \[1\]. The exception is thrown from {{ServerLocatorImpl::createSessionFactory}} which waits until initial broadcast from cluster is received, but it is not received in 10 seconds timeout. In following scenario the issue causes split brain - both live and backup are active at the same time.
> *Scenario*
> * There are two Wildfly/Artemis servers configured as Live-Backup pair
> * Live server is killed
> * Backup server becomes new Live server
> * Live server is restarted and failback is performed
> *Expectation:* Failback performs successfully. Original Live becomes Live again and original Backup becomes Backup again.
> *Reality:* After the Live is restarted, it does not detect that there is already another active Live with the same nodeId and it activates (becomes Live) so both Live and Backup are active at the same time.
> *Technical details*
> I found out that original Live server does not detect that its Backup is active, because SharedNothingLiveActivation::isNodeIdUsed returns false. I added more logs to this method and found out that the {{false}} is returned from line \[2\] after that the exception \[1\] is thrown.
> *Investigation report*
> I tried to downgrade Artemis to {{1.5.5.jbossorg-010}} version but I hit the same issue. So the issue is not caused by recent Artemis upgrade.
> It is regression against EAP 7.1.0.GA but it is not regression against Wildfly 12.
> We see the issue only with JGroups. I tried to run the test with Netty discovery and it passed 50 times in row.
> \[1\]
> {code}
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ119012: Timed out waiting to receive initial broadcast from cluster]
>         at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:749) [artemis-core-client-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:627) [artemis-core-client-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connectNoWarnings(ServerLocatorImpl.java:633) [artemis-core-client-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation.isNodeIdUsed(SharedNothingLiveActivation.java:280) [artemis-server-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation.run(SharedNothingLiveActivation.java:90) [artemis-server-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.internalStart(ActiveMQServerImpl.java:539) [artemis-server-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.start(ActiveMQServerImpl.java:485) [artemis-server-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.apache.activemq.artemis.jms.server.impl.JMSServerManagerImpl.start(JMSServerManagerImpl.java:413) [artemis-jms-server-1.5.5.jbossorg-012.jar:1.5.5.jbossorg-SNAPSHOT]
>         at org.wildfly.extension.messaging.activemq.jms.JMSService.doStart(JMSService.java:205) [wildfly-messaging-activemq-13.0.0.Beta2-SNAPSHOT.jar:13.0.0.Beta2-SNAPSHOT]
>         at org.wildfly.extension.messaging.activemq.jms.JMSService.access$000(JMSService.java:64) [wildfly-messaging-activemq-13.0.0.Beta2-SNAPSHOT.jar:13.0.0.Beta2-SNAPSHOT]
>         at org.wildfly.extension.messaging.activemq.jms.JMSService$1.run(JMSService.java:99) [wildfly-messaging-activemq-13.0.0.Beta2-SNAPSHOT.jar:13.0.0.Beta2-SNAPSHOT]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_171]
>         at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35) [jboss-threads-2.3.2.Final.jar:2.3.2.Final]
>         at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985) [jboss-threads-2.3.2.Final.jar:2.3.2.Final]
>         at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487) [jboss-threads-2.3.2.Final.jar:2.3.2.Final]
>         at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1378) [jboss-threads-2.3.2.Final.jar:2.3.2.Final]
>         at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171]
>         at org.jboss.threads.JBossThread.run(JBossThread.java:485) [jboss-threads-2.3.2.Final.jar:2.3.2.Final]
> {code}
> \[2\] https://github.com/rh-messaging/jboss-activemq-artemis/blob/465eb8733e465a949482cf618ad475b0a9e8afcd/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/SharedNothingLiveActivation.java#L281



--
This message was sent by Atlassian JIRA
(v7.5.0#75005)


More information about the jboss-jira mailing list