]
Dan Berindei commented on ISPN-9198:
------------------------------------
I'm still not sure what the modules have to do with anything, but it looks more and
more like an I/O issue on the edg-perfxx machines.
This is what the logs on slave0 (edg-perf01) say:
{noformat}
12:34:38,314 TRACE [org.jgroups.protocols.UDP] (jgroups-46,slave0) slave0: received [dst:
<null>, src: slave6 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL:
heartbeat, TP: [cluster_name=cluster]
12:34:48,497 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave0) haven't
received a heartbeat from slave6 for 10512 ms, adding it to suspect list
12:34:49,501 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave0) haven't
received a heartbeat from slave6 for 11512 ms, adding it to suspect list
12:34:52,789 TRACE [org.jgroups.protocols.UDP] (jgroups-35,slave0) slave0: received [dst:
<null>, src: slave6 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL:
heartbeat, TP: [cluster_name=cluster]
{noformat}
The logs on slave6 match, it doesn't send any heartbeat from second 38 to second 52,
in fact it doesn't log anything in that interval:
{noformat}
12:34:38,313 TRACE [org.jgroups.protocols.UDP] (Timer runner-1,slave6) slave6: sending msg
to null, src=slave6, headers are FD_ALL: heartbeat, TP: [cluster_name=cluster]
12:34:39,770 TRACE [org.jgroups.protocols.UDP] (Timer runner-1,slave6) slave6: sending msg
to slave7, src=slave6, headers are UNICAST3: ACK, seqno=4110, ts=1436, TP:
[cluster_name=cluster]
12:34:52,491 TRACE [org.jgroups.protocols.UDP] (Timer runner-1,slave6) slave6: sending msg
to slave4, src=slave6, headers are UNICAST3: ACK, seqno=4061, ts=1437, TP:
[cluster_name=cluster]
12:34:52,760 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave7 for 13299 ms, adding it to suspect list
12:34:52,764 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave4 for 13299 ms, adding it to suspect list
12:34:52,765 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave2 for 13299 ms, adding it to suspect list
12:34:52,773 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave3 for 13299 ms, adding it to suspect list
12:34:52,773 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave1 for 13299 ms, adding it to suspect list
12:34:52,773 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave5 for 13299 ms, adding it to suspect list
12:34:52,773 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) haven't
received a heartbeat from slave0 for 13299 ms, adding it to suspect list
12:34:52,774 DEBUG [org.jgroups.protocols.FD_ALL] (Timer runner-1,slave6) slave6:
suspecting [slave7, slave4, slave2, slave3, slave1, slave5, slave0]
12:34:52,788 TRACE [org.jgroups.protocols.UDP] (Timer runner-1,slave6) slave6: sending msg
to null, src=slave6, headers are FD_ALL: heartbeat, TP: [cluster_name=cluster]
{noformat}
This makes it look like the JVM was paused by a GC, but according to the GC log there was
no GC during that interval:
{noformat}
2018-05-29T12:34:39.865-0400: 80.271: Total time for which application threads were
stopped: 0.0117790 seconds, Stopping threads took: 0.0002650 seconds
2018-05-29T12:34:40.868-0400: 81.275: Total time for which application threads were
stopped: 0.0032410 seconds, Stopping threads took: 0.0002841 seconds
2018-05-29T12:34:41.872-0400: 82.279: Total time for which application threads were
stopped: 0.0038977 seconds, Stopping threads took: 0.0003439 seconds
2018-05-29T12:34:42.630-0400: 83.036: Total time for which application threads were
stopped: 0.0081630 seconds, Stopping threads took: 0.0035540 seconds
{noformat}
However, jmc shows one log write (I assume to the console, because there's no file
name) that takes almost 12s.
!Screenshot from 2018-05-30 13-21-32.png|thumbnail!
And it also looks like all the JGroups threads are blocked for the entire interval,
waiting for that one write to finish (see attached screenshots).
!Screenshot from 2018-05-30 13-25-09.png|thumbnail!
At the same time, it shows a lot of RejectedExecutionExceptions, so we know that the JVM
is still reading UDP packets, it just doesn't have any available threads to process
them.
Node X left the cluster - SuspectException: ISPN000400: Node X was
suspected
----------------------------------------------------------------------------
Key: ISPN-9198
URL:
https://issues.jboss.org/browse/ISPN-9198
Project: Infinispan
Issue Type: Bug
Reporter: Diego Lovison
Assignee: Dan Berindei
Attachments: Screenshot from 2018-05-30 13-21-32.png, Screenshot from 2018-05-30
13-25-09.png
After the commit df9ffb5ba46752d2509aa3a08c59519469cc929a in Infinispan, the tests
regression-cs-hotrod-dist-reads and regression-cs-hotrod-repl-reads are failing.
I ran 3 times the same test with the commit df9ffb5ba46752d2509aa3a08c59519469cc929a and
they are working.
If we run with master for "regression-cs-hotrod-dist-reads" it will because
of:
{noformat}
11:18:04,874 INFO [org.radargun.RemoteMasterConnection] (sc-main) Message successfully
sent to the master
11:22:06,470 INFO [org.radargun.Slave] (sc-main) Stage 'BasicOperationsTest'
should not be executed
11:22:06,472 INFO [org.radargun.RemoteMasterConnection] (sc-main) Message successfully
sent to the master
[0m[0m11:22:34,182 INFO [org.infinispan.CLUSTER] (jgroups-112,slave0) ISPN000094:
Received new cluster view for channel cluster: [slave3|8] (7) [slave3, slave6, slave0,
slave5, slave4, slave2, slave1]
[0m[0m11:22:34,190 INFO [org.infinispan.CLUSTER] (jgroups-112,slave0) ISPN100001: Node
slave7 left the cluster
[0m[0m11:22:48,182 INFO [org.infinispan.CLUSTER] (jgroups-115,slave0) ISPN000094:
Received new cluster view for channel cluster: [slave3|9] (6) [slave3, slave6, slave0,
slave5, slave4, slave1]
[0m[0m11:22:48,191 INFO [org.infinispan.CLUSTER] (jgroups-115,slave0) ISPN100001: Node
slave2 left the cluster
[0m[0m11:23:09,176 INFO [org.infinispan.CLUSTER] (jgroups-111,slave0) ISPN000094:
Received new cluster view for channel cluster: [slave3|10] (5) [slave3, slave6, slave0,
slave5, slave1]
[0m[0m11:23:09,179 INFO [org.infinispan.CLUSTER] (jgroups-111,slave0) ISPN100001: Node
slave4 left the cluster
[0m[0m11:23:20,173 INFO [org.infinispan.CLUSTER] (jgroups-121,slave0) ISPN000094:
Received new cluster view for channel cluster: [slave6|11] (4) [slave6, slave0, slave5,
slave1]
[0m[0m11:23:20,178 INFO [org.infinispan.CLUSTER] (jgroups-121,slave0) ISPN100001: Node
slave3 left the cluster
[0m[33m11:23:20,199 WARN [org.infinispan.statetransfer.InboundTransferTask]
(stateTransferExecutor-thread--p5-t60) ISPN000210: Failed to request state of cache
memcachedCache from node slave3, segments {114 184 190-191}:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave3 was
suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave3 was
suspected
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:92)
at org.infinispan.remoting.rpc.RpcManagerImpl.blocking(RpcManagerImpl.java:261)
at
org.infinispan.statetransfer.InboundTransferTask.startTransfer(InboundTransferTask.java:134)
at
org.infinispan.statetransfer.InboundTransferTask.requestSegments(InboundTransferTask.java:113)
at
org.infinispan.statetransfer.StateConsumerImpl.lambda$addTransfer$7(StateConsumerImpl.java:1073)
at
org.infinispan.executors.LimitedExecutor.lambda$executeAsync$1(LimitedExecutor.java:130)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
... 3 more
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node
slave3 was suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
... 3 more
[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave3 was
suspected]
[0m[33m11:23:23,916 WARN [org.jgroups.protocols.pbcast.NAKACK2] (jgroups-117,slave0)
JGRP000011: slave0: dropped message 319143 from non-member slave3 (view=[slave6|11] (4)
[slave6, slave0, slave5, slave1])
[0m[0m11:23:34,142 INFO [org.infinispan.CLUSTER] (jgroups-111,slave0) ISPN000094:
Received new cluster view for channel cluster: [slave6|12] (3) [slave6, slave0, slave5]
[0m[0m11:23:34,145 INFO [org.infinispan.CLUSTER] (jgroups-111,slave0) ISPN100001: Node
slave1 left the cluster
[0m[33m11:23:34,154 WARN [org.infinispan.statetransfer.InboundTransferTask]
(stateTransferExecutor-thread--p5-t61) ISPN000210: Failed to request state of cache
hotrodDist from node slave1, segments {59-60 71-74 78 81 146 180-181 185 192 217}:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:92)
at org.infinispan.remoting.rpc.RpcManagerImpl.blocking(RpcManagerImpl.java:261)
at
org.infinispan.statetransfer.InboundTransferTask.startTransfer(InboundTransferTask.java:134)
at
org.infinispan.statetransfer.InboundTransferTask.requestSegments(InboundTransferTask.java:113)
at
org.infinispan.statetransfer.StateConsumerImpl.lambda$addTransfer$7(StateConsumerImpl.java:1073)
at
org.infinispan.executors.LimitedExecutor.lambda$executeAsync$1(LimitedExecutor.java:130)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
... 3 more
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node
slave1 was suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
... 3 more
[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected]
[0m[33m11:23:34,183 WARN [org.infinispan.statetransfer.InboundTransferTask]
(stateTransferExecutor-thread--p5-t58) ISPN000210: Failed to request state of cache rest
from node slave1, segments {59-60 71-74 78 81 146 180-181 185 192 217}:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:92)
at org.infinispan.remoting.rpc.RpcManagerImpl.blocking(RpcManagerImpl.java:261)
at
org.infinispan.statetransfer.InboundTransferTask.startTransfer(InboundTransferTask.java:134)
at
org.infinispan.statetransfer.InboundTransferTask.requestSegments(InboundTransferTask.java:113)
at
org.infinispan.statetransfer.StateConsumerImpl.lambda$addTransfer$7(StateConsumerImpl.java:1073)
at
org.infinispan.executors.LimitedExecutor.lambda$executeAsync$1(LimitedExecutor.java:130)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
... 3 more
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node
slave1 was suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
... 3 more
[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected]
[0m[33m11:23:34,177 WARN [org.infinispan.statetransfer.InboundTransferTask]
(stateTransferExecutor-thread--p5-t54) ISPN000210: Failed to request state of cache
memcachedCache from node slave1, segments {59-60 71-74 78 81 146 180-181 185 192 217}:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:92)
at org.infinispan.remoting.rpc.RpcManagerImpl.blocking(RpcManagerImpl.java:261)
at
org.infinispan.statetransfer.InboundTransferTask.startTransfer(InboundTransferTask.java:134)
at
org.infinispan.statetransfer.InboundTransferTask.requestSegments(InboundTransferTask.java:113)
at
org.infinispan.statetransfer.StateConsumerImpl.lambda$addTransfer$7(StateConsumerImpl.java:1073)
at
org.infinispan.executors.LimitedExecutor.lambda$executeAsync$1(LimitedExecutor.java:130)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
... 3 more
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node
slave1 was suspected
at
org.infinispan.remoting.transport.ResponseCollectors.remoteNodeSuspected(ResponseCollectors.java:33)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:31)
at
org.infinispan.remoting.transport.impl.SingleResponseCollector.targetNotFound(SingleResponseCollector.java:17)
at
org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:23)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
at
org.infinispan.remoting.transport.impl.SingleTargetRequest.onNewView(SingleTargetRequest.java:42)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$null$3(JGroupsTransport.java:672)
at
org.infinispan.remoting.transport.impl.RequestRepository.lambda$forEach$0(RequestRepository.java:60)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.infinispan.remoting.transport.impl.RequestRepository.forEach(RequestRepository.java:60)
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$receiveClusterView$4(JGroupsTransport.java:672)
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)
... 3 more
[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node slave1 was
suspected]
[0m[31m11:23:34,203 ERROR [org.infinispan.statetransfer.StateConsumerImpl]
(transport-thread--p4-t5) ISPN000208: No live owners found for segments {59-60 71 73-74 78
81 146 180-181 185 192 217} of cache rest. Excluded owners: []
[0m[31m11:23:34,318 ERROR [org.infinispan.statetransfer.StateConsumerImpl]
(transport-thread--p4-t11) ISPN000208: No live owners found for segments {59-60 71 73-74
78 81 146 180-181 185 192 217} of cache memcachedCache. Excluded owners: []
[0m[33m11:23:34,332 WARN [org.infinispan.statetransfer.StateConsumerImpl]
(stateTransferExecutor-thread--p5-t61) Discarding received cache entries for segment 72 of
cache memcachedCache because they do not belong to this node.
[0m[33m11:23:34,396 WARN [org.infinispan.statetransfer.StateConsumerImpl]
(stateTransferExecutor-thread--p5-t57) Discarding received cache entries for segment 72 of
cache memcachedCache because they do not belong to this node.
[0m[31m11:23:34,398 ERROR [org.infinispan.statetransfer.StateConsumerImpl]
(transport-thread--p4-t1) ISPN000208: No live owners found for segments {71 74 146} of
cache hotrodDist. Excluded owners: []
[0m[33m11:23:34,521 WARN [org.jgroups.protocols.pbcast.NAKACK2] (jgroups-123,slave0)
JGRP000011: slave0: dropped message 338000 from non-member slave1 (view=[slave6|12] (3)
[slave6, slave0, slave5])
[0m[33m11:23:51,210 WARN [org.jgroups.protocols.pbcast.GMS] (jgroups-124,slave0)
slave0: not member of view [slave6|13]; discarding it
[0m[33m11:24:02,223 WARN [org.jgroups.protocols.pbcast.GMS] (jgroups-119,slave0)
slave0: failed to create view from delta-view; dropping view:
java.lang.IllegalStateException: the view-id of the delta view ([slave6|13]) doesn't
match the current view-id ([slave6|12]); discarding delta view [slave6|14],
ref-view=[slave6|13], left=[slave5]
[0m[33m11:24:02,231 WARN [org.jgroups.protocols.pbcast.GMS] (jgroups-119,slave0)
slave0: not member of view [slave6|14]; discarding it
[0m[33m11:24:11,932 WARN [org.jgroups.protocols.pbcast.GMS] (jgroups-119,slave0)
slave0: not member of view [slave5|15]; discarding it
[0m[0m11:24:12,485 INFO [org.infinispan.CLUSTER]
(VERIFY_SUSPECT.TimerThread-129,slave0) ISPN000094: Received new cluster view for channel
cluster: [slave0|16] (2) [slave0, slave5]
[0m[0m11:24:12,488 INFO [org.infinispan.CLUSTER]
(VERIFY_SUSPECT.TimerThread-129,slave0) ISPN100001: Node slave6 left the cluster
[0m[33m11:24:14,492 WARN [org.jgroups.protocols.pbcast.GMS]
(VERIFY_SUSPECT.TimerThread-129,slave0) slave0: failed to collect all ACKs (expected=1)
for view [slave0|16] after 2000ms, missing 1 ACKs from (1) slave5
[0m[33m11:24:34,209 WARN [org.jgroups.protocols.pbcast.NAKACK2] (jgroups-128,slave0)
JGRP000011: slave0: dropped message 319152 from non-member slave3 (view=[slave0|16] (2)
[slave0, slave5]) (received 11 identical messages from slave3 in the last 70294 ms)
[0m[0m11:25:10,121 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN000093:
Received new, MERGED cluster view for channel cluster: MergeView::[slave3|17] (7) [slave3,
slave5, slave2, slave0, slave4, slave1, slave7], 6 subgroups: [slave5|15] (1) [slave5],
[slave3|15] (2) [slave3, slave0], [slave0|16] (2) [slave0, slave5], [slave3|7] (8)
[slave3, slave6, slave0, slave5, slave4, slave2, slave7, slave1], [slave3|8] (7) [slave3,
slave6, slave0, slave5, slave4, slave2, slave1], [slave3|9] (6) [slave3, slave6, slave0,
slave5, slave4, slave1]
[0m[0m11:25:10,124 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN100000: Node
slave3 joined the cluster
[0m[0m11:25:10,127 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN100000: Node
slave2 joined the cluster
[0m[0m11:25:10,128 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN100000: Node
slave4 joined the cluster
[0m[0m11:25:10,129 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN100000: Node
slave1 joined the cluster
[0m[0m11:25:10,130 INFO [org.infinispan.CLUSTER] (jgroups-128,slave0) ISPN100000: Node
slave7 joined the cluster
[0m[33m11:25:10,362 WARN
[org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy]
(stateTransferExecutor-thread--p5-t58) ISPN000517: Ignoring cache topology from [slave0]
during merge: CacheTopology{id=49, phase=NO_REBALANCE, rebalanceId=14,
currentCH=DefaultConsistentHash{ns=256, owners = (3)[slave6: 86+75, slave5: 82+89, slave0:
88+92]}, pendingCH=null, unionCH=null, actualMembers=[slave6, slave5, slave0],
persistentUUIDs=[c1c2227d-2656-431e-a5b5-721459759a7f,
30123e75-2b33-46bc-a2e2-15f882782719, 2d550811-e842-4382-b8ae-3f36973f49f9]}
[0m[0m11:25:10,374 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=rest]ISPN100007: After merge (or
coordinator change), recovered members [slave5] with topology id 57
[0m[0m11:25:10,374 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t73) [Context=__global_tx_table__]ISPN100007: After
merge (or coordinator change), recovered members [slave5] with topology id 43
[0m[0m11:25:10,374 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t70) [Context=___hotRodTopologyCache]ISPN100007: After
merge (or coordinator change), recovered members [slave5] with topology id 43
[0m[0m11:25:10,374 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t71) [Context=___protobuf_metadata]ISPN100007: After
merge (or coordinator change), recovered members [slave5] with topology id 43
[0m[0m11:25:10,374 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t72) [Context=org.infinispan.CONFIG]ISPN100007: After
merge (or coordinator change), recovered members [slave5] with topology id 43
[0m[0m11:25:10,394 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=rest]ISPN100008: Updating cache members
list [slave5], topology id 58
[0m[0m11:25:10,415 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t70) [Context=___hotRodTopologyCache]ISPN100002:
Starting rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id
44
[0m[0m11:25:10,415 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t71) [Context=___protobuf_metadata]ISPN100002: Starting
rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 44
[0m[0m11:25:10,416 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t73) [Context=__global_tx_table__]ISPN100002: Starting
rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 44
[0m[0m11:25:10,419 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t71) [Context=hotrodRepl]ISPN100007: After merge (or
coordinator change), recovered members [slave5] with topology id 43
[0m[0m11:25:10,420 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=rest]ISPN100002: Starting rebalance with
members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 59
[0m[0m11:25:10,420 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t73) [Context=___script_cache]ISPN100007: After merge
(or coordinator change), recovered members [slave5] with topology id 42
[0m[33m11:25:10,419 WARN
[org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy]
(stateTransferExecutor-thread--p5-t70) ISPN000517: Ignoring cache topology from [slave0]
during merge: CacheTopology{id=47, phase=READ_ALL_WRITE_ALL, rebalanceId=14,
currentCH=DefaultConsistentHash{ns=256, owners = (3)[slave6: 83+30, slave5: 88+37, slave0:
85+35]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[slave6: 86+75, slave5:
82+89, slave0: 88+92]}, unionCH=null, actualMembers=[slave6, slave5, slave0],
persistentUUIDs=[c1c2227d-2656-431e-a5b5-721459759a7f,
30123e75-2b33-46bc-a2e2-15f882782719, 2d550811-e842-4382-b8ae-3f36973f49f9]}
[0m[0m11:25:10,422 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t70) [Context=memcachedCache]ISPN100007: After merge (or
coordinator change), recovered members [slave5] with topology id 52
[0m[0m11:25:10,424 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t72) [Context=org.infinispan.CONFIG]ISPN100002: Starting
rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 44
[0m[33m11:25:10,423 WARN
[org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy]
(stateTransferExecutor-thread--p5-t58) ISPN000517: Ignoring cache topology from [slave0]
during merge: CacheTopology{id=44, phase=READ_OLD_WRITE_ALL, rebalanceId=13,
currentCH=DefaultConsistentHash{ns=256, owners = (3)[slave6: 83+30, slave5: 88+37, slave0:
85+35]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[slave6: 86+75, slave5:
82+89, slave0: 88+92]}, unionCH=null, actualMembers=[slave6, slave5, slave0],
persistentUUIDs=[c1c2227d-2656-431e-a5b5-721459759a7f,
30123e75-2b33-46bc-a2e2-15f882782719, 2d550811-e842-4382-b8ae-3f36973f49f9]}
[0m[0m11:25:10,426 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t70) [Context=memcachedCache]ISPN100008: Updating cache
members list [slave5], topology id 53
[0m[0m11:25:10,426 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=hotrodDist]ISPN100007: After merge (or
coordinator change), recovered members [slave5] with topology id 49
[0m[0m11:25:10,430 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=hotrodDist]ISPN100008: Updating cache
members list [slave5], topology id 50
[0m[0m11:25:10,431 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t71) [Context=hotrodRepl]ISPN100002: Starting rebalance
with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 44
[0m[0m11:25:10,431 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t73) [Context=___script_cache]ISPN100002: Starting
rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 43
[0m[0m11:25:10,435 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t70) [Context=memcachedCache]ISPN100002: Starting
rebalance with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 54
[0m[0m11:25:10,438 INFO [org.infinispan.CLUSTER]
(stateTransferExecutor-thread--p5-t58) [Context=hotrodDist]ISPN100002: Starting rebalance
with members [slave5, slave0], phase READ_OLD_WRITE_ALL, topology id 51
[0m[33m11:25:12,134 WARN [org.jgroups.protocols.pbcast.GMS] (jgroups-128,slave0)
slave0: failed to collect all ACKs (expected=1) for view [slave3|17] after 2000ms, missing
1 ACKs from (1) slave5
11:26:11,452 INFO [org.radargun.Slave] (sc-main) Starting stage ClusterSplitVerify
11:26:11,453 ERROR [org.radargun.stages.monitor.ClusterSplitVerifyStage] (sc-main)
Cluster size at the beginning of the test was 8 but changed to 7 during the test! Perhaps
a split occured, or a new node joined?
11:26:11,454 INFO [org.radargun.Slave] (sc-main) Finished stage ClusterSplitVerify
11:26:11,455 INFO [org.radargun.RemoteMasterConnection] (sc-main) Message successfully
sent to the master
11:26:11,571 INFO [org.radargun.Slave] (sc-main) Starting stage ScenarioDestroy
11:26:11,573 INFO [org.radargun.stages.ScenarioDestroyStage] (sc-main) Scenario
finished, destroying...
11:26:11,575 INFO [org.radargun.stages.ScenarioDestroyStage] (sc-main) Memory before
cleanup:
Runtime free: 1,484,671 kb
Runtime max:27,960,320 kb
Runtime total:1,974,784 kb
{noformat}
If we run with master for "regression-cs-hotrod-dist-writes" it will because
of:
{noformat}
21:39:52,651 INFO [org.radargun.RemoteSlaveConnection] (main) Master started and
listening for connection on: /172.18.1.18:2103
21:39:52,651 INFO [org.radargun.RemoteSlaveConnection] (main) Waiting 5 seconds for
server socket to open completely
21:39:57,655 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
16 slaves.
21:39:57,666 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
15 slaves.
21:39:57,667 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
14 slaves.
21:39:57,668 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
13 slaves.
21:39:57,669 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
12 slaves.
21:39:57,670 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
11 slaves.
21:39:57,671 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
10 slaves.
21:39:57,672 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
9 slaves.
21:39:57,674 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
8 slaves.
21:39:57,675 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
7 slaves.
21:39:57,676 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
6 slaves.
21:39:57,677 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
5 slaves.
21:39:57,678 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
4 slaves.
21:39:57,708 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
3 slaves.
21:39:57,710 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
2 slaves.
21:39:57,711 INFO [org.radargun.RemoteSlaveConnection] (main) Awaiting registration from
1 slaves.
21:44:57,668 ERROR [org.radargun.Master] (main) Exception in Master.run:
java.io.IOException: 1 slaves haven't connected within timeout!
at org.radargun.RemoteSlaveConnection.establish(RemoteSlaveConnection.java:112)
~[radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.Master.run(Master.java:59) [radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.LaunchMaster.main(LaunchMaster.java:34)
[radargun-core-3.0.0-SNAPSHOT.jar:?]
21:44:57,697 WARN [org.radargun.RemoteSlaveConnection] (main) Failed to send termination
to slaves.
java.lang.NullPointerException: null
at
org.radargun.RemoteSlaveConnection$SlaveRecord.access$100(RemoteSlaveConnection.java:63)
~[radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.RemoteSlaveConnection.mcastBuffer(RemoteSlaveConnection.java:201)
~[radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.RemoteSlaveConnection.mcastObject(RemoteSlaveConnection.java:211)
~[radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.RemoteSlaveConnection.release(RemoteSlaveConnection.java:357)
[radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.Master.run(Master.java:155) [radargun-core-3.0.0-SNAPSHOT.jar:?]
at org.radargun.LaunchMaster.main(LaunchMaster.java:34)
[radargun-core-3.0.0-SNAPSHOT.jar:?]
21:44:57,703 INFO [org.radargun.ShutDownHook] (Thread-1) Master process is being
shutdown
Master 16071 finished with value 127
kill: sending signal to 16071 failed: No such process
kill: sending signal to 16071 failed: No such process
{noformat}
The tests are related with HotRod during the reads operation in a replicated and
distributed cache.
The scenario is:
8 - Servers
8 - Slaves
Commits:
13/May - df9ffb5ba46752d2509aa3a08c59519469cc929a - Passed - Executed 3 times
14/May - df9ffb5ba46752d2509aa3a08c59519469cc929a - Passed - Executed 3 times (I
didn't executed this because it is the same of the above, just to keep the history)
15/May - 26ba1aeb1d66cf65fb5c410ec98629093c29ab0b - Need to double check (It passed)
16/May - 92a5e4f62c39d63221aed2ed5763081b626874e6 - Need to double check (It start
failing here)