[jboss-jira] [JBoss JIRA] (WFWIP-28) [Artemis 2.x upgrade] Unexptected crash of server in SOAK test
Francesco Nigro (JIRA)
issues at jboss.org
Mon Aug 6 11:08:00 EDT 2018
[ https://issues.jboss.org/browse/WFWIP-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615279#comment-13615279 ]
Francesco Nigro edited comment on WFWIP-28 at 8/6/18 11:07 AM:
---------------------------------------------------------------
Looking at https://mw-messaging-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP/job/early-testing-messaging-weekly-tests-soak/43/
most of the collected metrics are simply wrong (eg negative CPU utilisation and opened File descriptors) probably due to a missing/killed JVM process: my suspect (to be verified) are toward a OOM killer action.
I suppose that collecting the Sosreport and using a static node to avoid anything related the process to be cleaned up with the VM will help to find where the issue is: ATM just the collected logs aren't enough.
FYI the mentioned job has this log that shows the problem and when it has happened:
{code:java}
05:20:49,073 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned 0.01946859311864822
05:20:49,085 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned 0.1457247132429614
05:20:49,092 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned -1
05:20:49,100 pool-1-thread-3 WARN [org.jboss.qa.resourcemonitor.ThreadMeasurement:43] Error reseting peak counter
05:20:49,102 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned -1
05:20:49,145 Thread-4321 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=e7fb5df2, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
05:20:49,146 Thread-4327 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=ca7b4f32, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
was (Author: fnigro):
Looking at https://mw-messaging-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP/job/early-testing-messaging-weekly-tests-soak/43/
most of the collected metrics are simply wrong (CPU utilisation = -100 File descriptors = -1) probably due to a missing/killed JVM process: my suspect (to be verified) are toward a OOM killer action.
I suppose that collecting the Sosreport and using a static node to avoid anything related the process to be cleaned up with the VM will help to find where the issue is: ATM just the collected logs aren't enough.
FYI the mentioned job has this log that shows the problem and when it has happened:
{code:java}
05:20:49,073 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned 0.01946859311864822
05:20:49,085 pool-2-thread-2 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned 0.1457247132429614
05:20:49,092 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute ProcessCpuLoad returned -1
05:20:49,100 pool-1-thread-3 WARN [org.jboss.qa.resourcemonitor.ThreadMeasurement:43] Error reseting peak counter
05:20:49,102 pool-1-thread-4 DEBUG [org.jboss.qa.resourcemonitor.CpuLoadMeasurement:70] getCpu attribute SystemCpuLoad returned -1
05:20:49,145 Thread-4321 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=e7fb5df2, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
05:20:49,146 Thread-4327 (ActiveMQ-client-global-threads) TRACE [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl:525] AMQ214026: Failure captured on connectionID=ca7b4f32, performing failover or reconnection now
ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119006: Channel disconnected]
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:353)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1050)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
> [Artemis 2.x upgrade] Unexptected crash of server in SOAK test
> --------------------------------------------------------------
>
> Key: WFWIP-28
> URL: https://issues.jboss.org/browse/WFWIP-28
> Project: WildFly WIP
> Issue Type: Bug
> Components: Artemis
> Reporter: Miroslav Novak
> Assignee: Martyn Taylor
> Priority: Blocker
> Labels: feature-branch-blocker
>
> After ~13 hours there is unexpected crash of one server in SOAK test. There is no error/warning in the logs.
> Test Scenario:
> * Start 2 servers
> * Client sends messages to input queue. Messages then go through:
> * One server to another through MDB reading and sending them from remote container through resource adapter
> * Messages are forwarded from one server to another over JMS bridge and back over Core bridge
> * Messages have JMSReplyTo defined with a temporary queue, that is filled with responses for the client
> * Messages are read from the destination with stateless EJB and sent back to clients
> * Client reads the messages after the pass through all the soak modules.
> Pass Criteria: In the last step receiver consumes all messages sent by producer.
> Actual Result:
> After ~13 hours 1st server suddenly crashes. There is no error/warning in server logs.
> Issue was hit with Artemis 2.5.0 with https://github.com/jmesnil/wildfly/tree/WFLY-9407_upgrade_artemis_2.4.0_with_prefix (commit 51dd8102f103ccb0470a3cfc8713d3f9bdb1b65d)
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
More information about the jboss-jira
mailing list