[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-4949:
------------------------------------
I have talked to Bela and he's considering installing the view in two phases. In the first phase, the coordinator would check that each of the view members is available, so it wouldn't be possible for B to install view BCD.
However, we can do the same thing in Infinispan. Right now, the cache topology installation is done with a single asynchronous {{CacheTopologyControlCommand(CH_UPDATE)}} command. We can add a prepare phase that checks that all the topology members are available (i.e. responding to coordinator messages), and that would also prevent node B from installing cache topology BCD.
We could also say Infinispan shouldn't keep a partition as available if it's possible that writes from another partition will succeed. With {{numOwners == 2}}, if both partitions eliminate one key owner from the CH, they can both update the key. But if we required each partition to have a majority of owners, and {{numOwners == 3}}, it wouldn't be possible to update the key in both partitions. The main problem with that is that the we've always pushed {{numOwners = 2}} as the default, and with this change Infinispan would enter degraded mode after a single node crash.
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4631) NodeAuthentication*PassIT.testReadItemOnJoiningNode fails on RHEL6
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4631?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4631:
-----------------------------------------------
Vojtech Juranek <vjuranek(a)redhat.com> changed the Status of [bug 1148738|https://bugzilla.redhat.com/show_bug.cgi?id=1148738] from ON_QA to ASSIGNED
> NodeAuthentication*PassIT.testReadItemOnJoiningNode fails on RHEL6
> ------------------------------------------------------------------
>
> Key: ISPN-4631
> URL: https://issues.jboss.org/browse/ISPN-4631
> Project: Infinispan
> Issue Type: Bug
> Components: Integration , Security
> Affects Versions: 7.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Vojtech Juranek
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR1
>
>
> Failures appear only on the RHEL agents in CI, both in NodeAuthenticationKrbPassIT and NodeAuthenticationMD5PassIT:
> {noformat}
> java.lang.AssertionError: expected:<test_value> but was:<null>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:144)
> at org.infinispan.test.integration.security.embedded.AbstractNodeAuthentication.testReadItemOnJoiningNode(AbstractNodeAuthentication.java:94)
> at org.infinispan.test.integration.security.embedded.NodeAuthenticationKrbPassIT.testReadItemOnJoiningNode(NodeAuthenticationKrbPassIT.java:71)
> {noformat}
> The failure in {{NodeAuthentication*FailIT.testReadItemOnJoiningNode}} is almost certainly related:
> {noformat}
> java.lang.Exception: Unexpected exception, expected<org.infinispan.manager.EmbeddedCacheManagerStartupException> but was<java.lang.Exception>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:144)
> at org.infinispan.test.integration.security.embedded.AbstractNodeAuthentication.testReadItemOnJoiningNode(AbstractNodeAuthentication.java:94)
> at org.infinispan.test.integration.security.embedded.NodeAuthenticationMD5FailIT.testReadItemOnJoiningNode(NodeAuthenticationMD5FailIT.java:55)
> {noformat}
> http://ci.infinispan.org/viewLog.html?buildId=10776&tab=buildResultsDiv&b...
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4631) NodeAuthentication*PassIT.testReadItemOnJoiningNode fails on RHEL6
by Vojtech Juranek (JIRA)
[ https://issues.jboss.org/browse/ISPN-4631?page=com.atlassian.jira.plugin.... ]
Vojtech Juranek reopened ISPN-4631:
-----------------------------------
some tests now fail with
{noformat}
org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container.
To disable this check and allow Arquillian to connect to a running server, set allowConnectingToRunningServer to true in the container configuration
{noformat}
Tests need to add some {{BeforeClass}} hook which would ensure that now server from previous test is running (and also some {{AfterClass}} check that no servers survives)
> NodeAuthentication*PassIT.testReadItemOnJoiningNode fails on RHEL6
> ------------------------------------------------------------------
>
> Key: ISPN-4631
> URL: https://issues.jboss.org/browse/ISPN-4631
> Project: Infinispan
> Issue Type: Bug
> Components: Integration , Security
> Affects Versions: 7.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Vojtech Juranek
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR1
>
>
> Failures appear only on the RHEL agents in CI, both in NodeAuthenticationKrbPassIT and NodeAuthenticationMD5PassIT:
> {noformat}
> java.lang.AssertionError: expected:<test_value> but was:<null>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:144)
> at org.infinispan.test.integration.security.embedded.AbstractNodeAuthentication.testReadItemOnJoiningNode(AbstractNodeAuthentication.java:94)
> at org.infinispan.test.integration.security.embedded.NodeAuthenticationKrbPassIT.testReadItemOnJoiningNode(NodeAuthenticationKrbPassIT.java:71)
> {noformat}
> The failure in {{NodeAuthentication*FailIT.testReadItemOnJoiningNode}} is almost certainly related:
> {noformat}
> java.lang.Exception: Unexpected exception, expected<org.infinispan.manager.EmbeddedCacheManagerStartupException> but was<java.lang.Exception>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:144)
> at org.infinispan.test.integration.security.embedded.AbstractNodeAuthentication.testReadItemOnJoiningNode(AbstractNodeAuthentication.java:94)
> at org.infinispan.test.integration.security.embedded.NodeAuthenticationMD5FailIT.testReadItemOnJoiningNode(NodeAuthenticationMD5FailIT.java:55)
> {noformat}
> http://ci.infinispan.org/viewLog.html?buildId=10776&tab=buildResultsDiv&b...
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Dan Berindei edited comment on ISPN-4949 at 11/10/14 4:46 AM:
--------------------------------------------------------------
As commented on IRC by [~dan.berindei], the problem is deeper. When the cluster ABCD can break into views ABC, BCD, it's possible that both (active) parts will modify an entry.
It seems the nodes need to achieve consensus about view membership - C must not be part of two views at any moment. That requires a modification in JGroups, not in Infinispan, and may prove troublesome even from theoretical perspective.
was (Author: rvansa):
As commented on IRC by [~dan.berindei], the problem is deeper. When the cluster ABCD can break into views ABC, CDB, it's possible that both (active) parts will modify an entry.
It seems the nodes need to achieve consensus about view membership - C must not be part of two views at any moment. That requires a modification in JGroups, not in Infinispan, and may prove troublesome even from theoretical perspective.
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4828) Increasing default internal thread pool size
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4828?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration updated ISPN-4828:
------------------------------------------
Bugzilla Update: Perform
Bugzilla References: https://bugzilla.redhat.com/show_bug.cgi?id=1160635
> Increasing default internal thread pool size
> --------------------------------------------
>
> Key: ISPN-4828
> URL: https://issues.jboss.org/browse/ISPN-4828
> Project: Infinispan
> Issue Type: Enhancement
> Components: Configuration, Core
> Affects Versions: 7.0.0.CR1
> Reporter: Matej Čimbora
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.0.0.Final
>
>
> Using synchronous replication with high number of concurrent clients doing put() operations over a shared set of keys, lock-acquisition timeouts occur when various thread pools (internal, jgroups oob) do not have appropriate size.
> org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [3 seconds] on key [key_00000000000003B4] for requestor [Thread[OOB-66,default,node03-12795,5,main]]! Lock held by [Thread[OOB-314,default,node03-12795,5,main]]
> [org.infinispan.interceptors.InvocationContextInterceptor] (Stressor-1) ISPN000136: Execution error
> org.infinispan.util.concurrent.TimeoutException: org.infinispan.util.concurrent.TimeoutException: Node node04-24454 timed out
> This applies to both transactional and non-transactional configuration. The problem can be mitigated by increasing Infinispan's internal thread pool size (defined for remoteCommandsExecutor, blockingBoundedQueueThreadPool). In order to improve user experience either:
> a) When needed, the size of the thread pool should be increased as the load increases
> b) The default values should be high enough to handle even significant load (in terms of number of concurrent clients per node)
> c) The documentation should describe how the end user should size the thread pools based on expected load on the system
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4828) Increasing default internal thread pool size
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4828?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4828:
-----------------------------------------------
Matej Čimbora <mcimbora(a)redhat.com> changed the Status of [bug 1160635|https://bugzilla.redhat.com/show_bug.cgi?id=1160635] from ON_QA to VERIFIED
> Increasing default internal thread pool size
> --------------------------------------------
>
> Key: ISPN-4828
> URL: https://issues.jboss.org/browse/ISPN-4828
> Project: Infinispan
> Issue Type: Enhancement
> Components: Configuration, Core
> Affects Versions: 7.0.0.CR1
> Reporter: Matej Čimbora
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.0.0.Final
>
>
> Using synchronous replication with high number of concurrent clients doing put() operations over a shared set of keys, lock-acquisition timeouts occur when various thread pools (internal, jgroups oob) do not have appropriate size.
> org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [3 seconds] on key [key_00000000000003B4] for requestor [Thread[OOB-66,default,node03-12795,5,main]]! Lock held by [Thread[OOB-314,default,node03-12795,5,main]]
> [org.infinispan.interceptors.InvocationContextInterceptor] (Stressor-1) ISPN000136: Execution error
> org.infinispan.util.concurrent.TimeoutException: org.infinispan.util.concurrent.TimeoutException: Node node04-24454 timed out
> This applies to both transactional and non-transactional configuration. The problem can be mitigated by increasing Infinispan's internal thread pool size (defined for remoteCommandsExecutor, blockingBoundedQueueThreadPool). In order to improve user experience either:
> a) When needed, the size of the thread pool should be increased as the load increases
> b) The default values should be high enough to handle even significant load (in terms of number of concurrent clients per node)
> c) The documentation should describe how the end user should size the thread pools based on expected load on the system
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4848) Offer way to implement includeCurrentState based on indexed query
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4848?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4848:
-----------------------------------------------
Matej Čimbora <mcimbora(a)redhat.com> changed the Status of [bug 1160635|https://bugzilla.redhat.com/show_bug.cgi?id=1160635] from ON_QA to VERIFIED
> Offer way to implement includeCurrentState based on indexed query
> -----------------------------------------------------------------
>
> Key: ISPN-4848
> URL: https://issues.jboss.org/browse/ISPN-4848
> Project: Infinispan
> Issue Type: Feature Request
> Reporter: Emmanuel Bernard
> Assignee: Mircea Markus
>
> Based on the infinispan-dev mailing list discussion from September 2014 titled
> 'Feedback and requests on clustered and remote listeners'.
> Loading the whole state of the data grid to then filter / convert them can be costly especially if data has been passivated.
> An alternative could be offered that would delegate the includeCurrentState filtering with a global indexed query. That query would need to be provided by the user. While the query is run, the list of change events matching should be piled up and released once the query has be run and the converted events are send to the clustered listener.
> Maybe that should be only done for "continuous" queries as we could use the query for both the initial and continuous side of the query transparently.
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months