[jboss-jira] [JBoss JIRA] (WFWIP-28) [Artemis 2.x upgrade] Unexptected crash of server in SOAK test

Mon Aug 6 11:08:00 EDT 2018

    [ https://issues.jboss.org/browse/WFWIP-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615279#comment-13615279 ] 

Francesco Nigro edited comment on WFWIP-28 at 8/6/18 11:07 AM:
---------------------------------------------------------------

Looking at https://mw-messaging-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP/job/early-testing-messaging-weekly-tests-soak/43/
most of the collected metrics are simply wrong (eg negative CPU utilisation and opened File descriptors) probably due to a missing/killed JVM process: my suspect (to be verified) are toward a OOM killer action.
I suppose that collecting the Sosreport and using a static node to avoid anything related the process to be cleaned up with the VM will help to find where the issue is: ATM just the collected logs aren't enough.
FYI the mentioned job has this log that shows the problem and when it has happened:

{code:java}
05:20:49,073 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned 0.01946859311864822
05:20:49,085 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned 0.1457247132429614
05:20:49,092 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned -1
05:20:49,100 pool-1-thread-3 WARN  [org.jboss.qa.resourcemonitor.ThreadMeasurement:43] Error reseting peak counter
05:20:49,102 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned -1
05:20:49,145 Thread-4321 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=e7fb5df2, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
	at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
05:20:49,146 Thread-4327 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=ca7b4f32, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
	at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

was (Author: fnigro):
Looking at https://mw-messaging-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP/job/early-testing-messaging-weekly-tests-soak/43/
most of the collected metrics are simply wrong (CPU utilisation = -100 File descriptors = -1) probably due to a missing/killed JVM process: my suspect (to be verified) are toward a OOM killer action.
I suppose that collecting the Sosreport and using a static node to avoid anything related the process to be cleaned up with the VM will help to find where the issue is: ATM just the collected logs aren't enough.
FYI the mentioned job has this log that shows the problem and when it has happened:

{code:java}
05:20:49,073 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned 0.01946859311864822
05:20:49,085 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned 0.1457247132429614
05:20:49,092 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned -1
05:20:49,100 pool-1-thread-3 WARN  [org.jboss.qa.resourcemonitor.ThreadMeasurement:43] Error reseting peak counter
05:20:49,102 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned -1
05:20:49,145 Thread-4321 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=e7fb5df2, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
	at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
05:20:49,146 Thread-4327 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=ca7b4f32, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
	at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

> [Artemis 2.x upgrade] Unexptected crash of server in SOAK test
> --------------------------------------------------------------
>
>                 Key: WFWIP-28
>                 URL: https://issues.jboss.org/browse/WFWIP-28
>             Project: WildFly WIP
>          Issue Type: Bug
>          Components: Artemis
>            Reporter: Miroslav Novak
>            Assignee: Martyn Taylor
>            Priority: Blocker
>              Labels: feature-branch-blocker
>
> After ~13 hours there is unexpected crash of one server in SOAK test. There is no error/warning in the logs. 
> Test Scenario:
> * Start 2 servers 
> * Client sends messages to input queue. Messages then go through:
> * One server to another through MDB reading and sending them from remote container through resource adapter
> * Messages are forwarded from one server to another over JMS bridge and back over Core bridge
> * Messages have JMSReplyTo defined with a temporary queue, that is filled with responses for the client
> * Messages are read from the destination with stateless EJB and sent back to clients
> * Client reads the messages after the pass through all the soak modules.
> Pass Criteria: In the last step receiver consumes all messages sent by producer.
> Actual Result:
> After ~13 hours 1st server suddenly crashes. There is no error/warning in server logs.
> Issue was hit with Artemis 2.5.0 with https://github.com/jmesnil/wildfly/tree/WFLY-9407_upgrade_artemis_2.4.0_with_prefix (commit 51dd8102f103ccb0470a3cfc8713d3f9bdb1b65d)

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)