[JBoss JIRA] (ISPN-6275) Double invalidate of invalid Hot Rod connections
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-6275?page=com.atlassian.jira.plugin.... ]
Work on ISPN-6275 started by Galder Zamarreño.
----------------------------------------------
> Double invalidate of invalid Hot Rod connections
> ------------------------------------------------
>
> Key: ISPN-6275
> URL: https://issues.jboss.org/browse/ISPN-6275
> Project: Infinispan
> Issue Type: Bug
> Components: Remote Protocols
> Affects Versions: 6.0.2.Final
> Reporter: Dennis Reed
> Assignee: Galder Zamarreño
>
> When there's a problem with a Hot Rod operation, RetryOnFailureOperation invalidates the connection twice (once in a catch block, and once in a finally block).
> This causes the GenericKeyedObjectPool counts to get off, and anything relying on that count (such as the maxTotal configuration for the pool) to break.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years
[JBoss JIRA] (ISPN-6388) Spark integration - TimeoutException: Replication timeout on application execution
by Matej Čimbora (JIRA)
[ https://issues.jboss.org/browse/ISPN-6388?page=com.atlassian.jira.plugin.... ]
Matej Čimbora commented on ISPN-6388:
-------------------------------------
I looked into the issue some time to ago, however couldn't finish it due to context switch. DistributedCacheStream.rehashAwareIteration shows multiple stayLocal=false evaluations.
> Spark integration - TimeoutException: Replication timeout on application execution
> -----------------------------------------------------------------------------------
>
> Key: ISPN-6388
> URL: https://issues.jboss.org/browse/ISPN-6388
> Project: Infinispan
> Issue Type: Bug
> Components: Spark
> Affects Versions: 8.2.0.Final
> Reporter: Matej Čimbora
> Attachments: app_0.txt, driver.txt, server.txt
>
>
> The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
> Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.
> Attached sample driver, server, application logs.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years
[JBoss JIRA] (ISPN-6387) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: Replication timeout for X
by Radoslav Husar (JIRA)
[ https://issues.jboss.org/browse/ISPN-6387?page=com.atlassian.jira.plugin.... ]
Radoslav Husar commented on ISPN-6387:
--------------------------------------
Tried to backport https://github.com/infinispan/infinispan/pull/4133 but that did not help.
> ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: Replication timeout for X
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: ISPN-6387
> URL: https://issues.jboss.org/browse/ISPN-6387
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.1.2.Final
> Reporter: Radoslav Husar
> Assignee: Radoslav Husar
>
> Booting WF with starting caches yields after 1 minute:
> The problematic call originates in Infinispan's org.infinispan.topology.ClusterTopologyManagerImpl#confirmMembersAvailable heartbeat command.
> {noformat}
> 00:20:51,646 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (timeout-thread--p10-t1) Response: sender=node2, received=false, suspected=false
> 00:20:51,647 WARN [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p13-t2) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: Replication timeout for node2
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$0(JGroupsTransport.java:599)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)
> at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years
[JBoss JIRA] (ISPN-6388) Spark integration - TimeoutException: Replication timeout on application execution
by Matej Čimbora (JIRA)
[ https://issues.jboss.org/browse/ISPN-6388?page=com.atlassian.jira.plugin.... ]
Matej Čimbora updated ISPN-6388:
--------------------------------
Description:
The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.
Attached sample driver, server, application logs.
was:
The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
Using 8 node cluster (1 worker/1 ISPN server per physical node).
Attached sample driver, server, application logs.
> Spark integration - TimeoutException: Replication timeout on application execution
> -----------------------------------------------------------------------------------
>
> Key: ISPN-6388
> URL: https://issues.jboss.org/browse/ISPN-6388
> Project: Infinispan
> Issue Type: Bug
> Components: Spark
> Affects Versions: 8.2.0.Final
> Reporter: Matej Čimbora
> Attachments: app_0.txt, driver.txt, server.txt
>
>
> The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
> Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.
> Attached sample driver, server, application logs.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years
[JBoss JIRA] (ISPN-6388) Spark integration - TimeoutException: Replication timeout on application execution
by Matej Čimbora (JIRA)
Matej Čimbora created ISPN-6388:
-----------------------------------
Summary: Spark integration - TimeoutException: Replication timeout on application execution
Key: ISPN-6388
URL: https://issues.jboss.org/browse/ISPN-6388
Project: Infinispan
Issue Type: Bug
Components: Spark
Affects Versions: 8.2.0.Final
Reporter: Matej Čimbora
Attachments: app_0.txt, driver.txt, server.txt
The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
Using 8 node cluster (1 worker/1 ISPN server per physical node).
Attached sample driver, server, application logs.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years
[JBoss JIRA] (ISPN-6239) InitialClusterSizeTest.testInitialClusterSizeFail random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-6239?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-6239:
------------------------------------
While trying to reproduce the failure on my machine, I found another failure caused by a concurrency issue in {{TEST_PING}}:
{noformat}
12:44:36,043 TRACE (ForkThread-4,InitialClusterSizeTest:) [TEST_PING] Discoveries for DiscoveryKey{clusterName='ISPN', testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {}
12:44:36,043 TRACE (ForkThread-1,InitialClusterSizeTest:) [TEST_PING] Discoveries for DiscoveryKey{clusterName='ISPN', testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {}
12:44:36,043 TRACE (ForkThread-1,InitialClusterSizeTest:) [TEST_PING] Add discovery for NodeA-45697 to cache. The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-4,InitialClusterSizeTest:) [TEST_PING] Add discovery for NodeD-30921 to cache. The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-3,InitialClusterSizeTest:) [TEST_PING] Discoveries for DiscoveryKey{clusterName='ISPN', testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-3,InitialClusterSizeTest:) [TEST_PING] Add discovery for NodeC-59583 to cache. The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697, NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,043 TRACE (ForkThread-2,InitialClusterSizeTest:) [TEST_PING] Discoveries for DiscoveryKey{clusterName='ISPN', testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697, NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,044 TRACE (ForkThread-2,InitialClusterSizeTest:) [TEST_PING] Add discovery for NodeB-6005 to cache. The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697, NodeB-6005=TEST_PING@NodeB-6005, NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,044 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: discovery took 2 ms, members: 1 rsps (0 coords) [done]
12:44:36,044 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: could not determine coordinator from rsps 1 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: nodes to choose new coord from are: [NodeD-30921, NodeA-45697]
12:44:36,045 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: I (NodeD-30921) am the first of the nodes, will become coordinator
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: discovery took 3 ms, members: 3 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: could not determine coordinator from rsps 3 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: nodes to choose new coord from are: [NodeC-59583, NodeD-30921, NodeB-6005, NodeA-45697]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: I (NodeB-6005) am not the first of the nodes, waiting for another client to become coordinator
{noformat}
The cluster starts as 2 partitions with NodeB and NodeD as coordinators, and because the test doesn't use {{TransportFlags.withMerge()}}, the partitions will never merge.
> InitialClusterSizeTest.testInitialClusterSizeFail random failures
> -----------------------------------------------------------------
>
> Key: ISPN-6239
> URL: https://issues.jboss.org/browse/ISPN-6239
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 8.2.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_failure
> Fix For: 8.2.0.CR1, 8.2.0.Final
>
>
> The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
> However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
> The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
> I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years