[JBoss JIRA] (ISPN-4766) Cache can't start if coordinator leaves during join and joiner becomes the new coordinator
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4766?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-4766:
------------------------------------
This caused a test failure in CI, because the FD_SOCK timed out opening the socket (timeout is 1s) and the 2nd node split in a separate cluster immediately after joining (ISPN-4787).
> Cache can't start if coordinator leaves during join and joiner becomes the new coordinator
> ------------------------------------------------------------------------------------------
>
> Key: ISPN-4766
> URL: https://issues.jboss.org/browse/ISPN-4766
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 7.0.0.CR1
>
>
> When the joiner becomes the coordinator, it tries to recover the current cache topologies, but it receives just one expected member and no current topology. This causes a NPE in ClusterCacheStatus:
> {noformat}
> 22:51:49,547 ERROR (transport-thread-NodeB-p21124-t1:) [ClusterCacheStatus] ISPN000228: Failed to recover cache dist state after the current node became the coordinator
> java.lang.NullPointerException
> at org.infinispan.partionhandling.impl.PreferAvailabilityStrategy.onPartitionMerge(PreferAvailabilityStrategy.java:104)
> at org.infinispan.topology.ClusterCacheStatus.doMergePartitions(ClusterCacheStatus.java:452)
> at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:260)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:180)
> at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:427)
> {noformat}
> The LocalTopologyManagerImpl waits a bit after receiving the SuspectException and tries again, but this time it receives a {{null}} initial topology, causing another NPE:
> {noformat}
> 22:51:51,319 DEBUG (testng-GlobalKeySetTaskTest:) [LocalTopologyManagerImpl] Error sending join request for cache dist to coordinator
> java.lang.NullPointerException
> at org.infinispan.topology.LocalTopologyManagerImpl.resetLocalTopologyBeforeRebalance(LocalTopologyManagerImpl.java:222)
> at org.infinispan.topology.LocalTopologyManagerImpl.handleTopologyUpdate(LocalTopologyManagerImpl.java:191)
> at org.infinispan.topology.LocalTopologyManagerImpl.join(LocalTopologyManagerImpl.java:105)
> at org.infinispan.statetransfer.StateTransferManagerImpl.start(StateTransferManagerImpl.java:108)
> {noformat}
> This keeps going on until the state transfer timeout expires.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4787) FD_SOCK timeout causing random test failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4787?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-4787:
------------------------------------
CI failure here (DEBUG only): http://ci.infinispan.org/viewLog.html?buildId=12388&buildTypeId=bt9&tab=b...
> FD_SOCK timeout causing random test failures
> --------------------------------------------
>
> Key: ISPN-4787
> URL: https://issues.jboss.org/browse/ISPN-4787
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 7.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR1
>
>
> When a test doesn't require failure detection, we remove the FD protocol from the JGroups stack, but we keep FD_SOCK. Normally this isn't a problem, but in rare occasions it can fail to open the ping socked and the cluster doesn't form:
> {noformat}
> 22:51:45,978 DEBUG (testng-GlobalKeySetTaskTest:) [FD_SOCK] NodeA-60950: VIEW_CHANGE received: [NodeA-60950]
> 22:51:46,401 DEBUG (Incoming-1,NodeA-60950:) [FD_SOCK] NodeA-60950: VIEW_CHANGE received: [NodeA-60950, NodeB-24360]
> 22:51:46,675 DEBUG (FD_SOCK pinger,NodeA-60950:) [FD_SOCK] NodeA-60950: ping_dest is NodeB-24360, pingable_mbrs=[NodeA-60950, NodeB-24360]
> 22:51:46,803 DEBUG (testng-GlobalKeySetTaskTest:) [FD_SOCK] NodeB-24360: VIEW_CHANGE received: [NodeA-60950, NodeB-24360]
> 22:51:47,149 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: ping_dest is NodeA-60950, pingable_mbrs=[NodeA-60950, NodeB-24360]
> 22:51:49,113 WARN (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: creating the client socket failed: java.net.SocketTimeoutException
> 22:51:49,116 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: could not create socket to NodeA-60950 (pinger thread is running)
> 22:51:49,116 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: suspecting NodeA-60950
> 22:51:49,117 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: ping_dest is null, pingable_mbrs=[NodeB-24360]
> 22:51:49,117 DEBUG (INT-2,NodeB-24360:) [FD_SOCK] NodeB-24360: suspecting [NodeA-60950]
> 22:51:49,262 DEBUG (Incoming-1,NodeB-24360:) [FD_SOCK] NodeB-24360: VIEW_CHANGE received: [NodeB-24360]
> 22:55:49,387 DEBUG (FD_SOCK pinger,NodeA-60950:) [FD_SOCK] 89fe2d3e-0b0a-dae8-a63a-6272ea5b7372: socket to NodeB-24360 was closed gracefully
> {noformat}
> We should increase {{FD_SOCK.sock_conn_timeout}} and remove FD_SOCK from the stack unless the test uses {{TransportFlags.withMerge()}}.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4789) Avoid having two OSGi containers running at the same time.
by Ion Savin (JIRA)
Ion Savin created ISPN-4789:
-------------------------------
Summary: Avoid having two OSGi containers running at the same time.
Key: ISPN-4789
URL: https://issues.jboss.org/browse/ISPN-4789
Project: Infinispan
Issue Type: Task
Affects Versions: 7.0.0.Beta2
Reporter: Ion Savin
Assignee: Ion Savin
Combining PerSuite and PerClass/PerMethod PAX EXAM reactors results in two containers running at the same time which could trigger resource conflicts similar to the one below:
{noformat}
karaf@root> Exception in thread "JMX Connector Thread [service:jmx:rmi://0.0.0.0:44444/jndi/rmi://0.0.0.0:1099/karaf-root]" java.lang.RuntimeException:
Port already in use: 44444;
You may have started two containers. If you need to start a second container or the default ports are already in use update the config file etc/org.apache.karaf.management.cfg and change the Registry Port and Server Port to unused ports
at org.apache.karaf.management.ConnectorServerFactory$1.run(ConnectorServerFactory.java:244)
{noformat}
To avoid this issue split the OSGi tests into two groups with separate Surefire executions - one with the container reused and one with the container started for each class/method.
Related to an earlier workaround: ISPN-4487
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4788) Remove pax-url-mvn dependency
by Ion Savin (JIRA)
Ion Savin created ISPN-4788:
-------------------------------
Summary: Remove pax-url-mvn dependency
Key: ISPN-4788
URL: https://issues.jboss.org/browse/ISPN-4788
Project: Infinispan
Issue Type: Task
Affects Versions: 7.0.0.Beta2
Reporter: Ion Savin
Assignee: Ion Savin
Remove the dependency on pax-url-mvn as it's been replaced by pax-url-aether.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4787) FD_SOCK timeout causing random test failures
by Dan Berindei (JIRA)
Dan Berindei created ISPN-4787:
----------------------------------
Summary: FD_SOCK timeout causing random test failures
Key: ISPN-4787
URL: https://issues.jboss.org/browse/ISPN-4787
Project: Infinispan
Issue Type: Bug
Components: Test Suite - Core
Affects Versions: 7.0.0.Beta2
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Blocker
Fix For: 7.0.0.CR1
When a test doesn't require failure detection, we remove the FD protocol from the JGroups stack, but we keep FD_SOCK. Normally this isn't a problem, but in rare occasions it can fail to open the ping socked and the cluster doesn't form:
{noformat}
22:51:45,978 DEBUG (testng-GlobalKeySetTaskTest:) [FD_SOCK] NodeA-60950: VIEW_CHANGE received: [NodeA-60950]
22:51:46,401 DEBUG (Incoming-1,NodeA-60950:) [FD_SOCK] NodeA-60950: VIEW_CHANGE received: [NodeA-60950, NodeB-24360]
22:51:46,675 DEBUG (FD_SOCK pinger,NodeA-60950:) [FD_SOCK] NodeA-60950: ping_dest is NodeB-24360, pingable_mbrs=[NodeA-60950, NodeB-24360]
22:51:46,803 DEBUG (testng-GlobalKeySetTaskTest:) [FD_SOCK] NodeB-24360: VIEW_CHANGE received: [NodeA-60950, NodeB-24360]
22:51:47,149 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: ping_dest is NodeA-60950, pingable_mbrs=[NodeA-60950, NodeB-24360]
22:51:49,113 WARN (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: creating the client socket failed: java.net.SocketTimeoutException
22:51:49,116 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: could not create socket to NodeA-60950 (pinger thread is running)
22:51:49,116 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: suspecting NodeA-60950
22:51:49,117 DEBUG (FD_SOCK pinger,NodeB-24360:) [FD_SOCK] NodeB-24360: ping_dest is null, pingable_mbrs=[NodeB-24360]
22:51:49,117 DEBUG (INT-2,NodeB-24360:) [FD_SOCK] NodeB-24360: suspecting [NodeA-60950]
22:51:49,262 DEBUG (Incoming-1,NodeB-24360:) [FD_SOCK] NodeB-24360: VIEW_CHANGE received: [NodeB-24360]
22:55:49,387 DEBUG (FD_SOCK pinger,NodeA-60950:) [FD_SOCK] 89fe2d3e-0b0a-dae8-a63a-6272ea5b7372: socket to NodeB-24360 was closed gracefully
{noformat}
We should increase {{FD_SOCK.sock_conn_timeout}} and remove FD_SOCK from the stack unless the test uses {{TransportFlags.withMerge()}}.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4784) TestResourceTracker not working properly for OSGi tests
by Ion Savin (JIRA)
Ion Savin created ISPN-4784:
-------------------------------
Summary: TestResourceTracker not working properly for OSGi tests
Key: ISPN-4784
URL: https://issues.jboss.org/browse/ISPN-4784
Project: Infinispan
Issue Type: Enhancement
Components: Test Suite - Core
Affects Versions: 7.0.0.Beta2
Reporter: Ion Savin
Assignee: Dan Berindei
The OSGi tests are running in a different process from the test driver and are executed through RMI. The assumptions that the test name is contained in the thread name and that there's a one-to-one mapping from thread to test is no longer valid (ThreadLocal used for the test name).
Executing the tests in integrationtests/osgi will result in many log messages similar to this one:
{noformat}
Test name not set in unknown thread RMI TCP Connection(3)-127.0.0.1
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months