[JBoss JIRA] (ISPN-4958) DefaultCacheManager startCaches do not restart stopped caches
by Mathieu Lachance (JIRA)
[ https://issues.jboss.org/browse/ISPN-4958?page=com.atlassian.jira.plugin.... ]
Mathieu Lachance updated ISPN-4958:
-----------------------------------
Description:
using DefaultCacheManager#startCaches do not start a previously stopped cache:
{code}
CacheManager cacheManager = new DefaultCacheManager();
cacheManager.startCaches("abc");
Cache cache = cacheManager.getCache("abc");
cache.stop();
cacheManager.startCaches("abc");
cache = cacheManager.getCache("abc);
cache.get("def"); // trow IllegalStateException
{code}
{quote}
java.lang.IllegalStateException: Cache 'abc' is in 'TERMINATED' state and so it does not accept new invocations. Either restart it or recreate the cache container.
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:110)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitGetKeyValueCommand(AbstractVisitor.java:104)
at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:58)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.CacheImpl.get(CacheImpl.java:289)
at org.infinispan.CacheImpl.get(CacheImpl.java:281)
{quote}
I think the issue is in the thread that will call the {{createCache(cacheName)}}.
By looking at the 7.0.0.Final source code:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
createCache(cacheName);
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
I think we should do instead the following:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
Cache cache = getCache(cacheName, false);
if (cache == null) {
createCache(cacheName);
}
else {
if (!ComponentStatus.RUNNING.equals(cache.getStatus())) {
cache.start();
}
}
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
was:
using DefaultCacheManager#startCaches do not start a previously stopped cache:
{code}
CacheManager cacheManager = new DefaultCacheManager();
cacheManager.startCaches("abc");
Cache cache = cacheManager.getCache("abc");
cache.stop();
cacheManager.startCaches("abc");
cache = cacheManager.getCache("abc);
cache.get("def"); // trow IllegalStateException
{code}
{quote}
java.lang.IllegalStateException: Cache 'XXXX' is in 'TERMINATED' state and so it does not accept new invocations. Either restart it or recreate the cache container.
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:110)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitGetKeyValueCommand(AbstractVisitor.java:104)
at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:58)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.CacheImpl.get(CacheImpl.java:289)
at org.infinispan.CacheImpl.get(CacheImpl.java:281)
{quote}
I think the issue is in the thread that will call the {{createCache(cacheName)}}.
By looking at the 7.0.0.Final source code:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
createCache(cacheName);
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
I think we should do instead the following:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
Cache cache = getCache(cacheName, false);
if (cache == null) {
createCache(cacheName);
}
else {
if (!ComponentStatus.RUNNING.equals(cache.getStatus())) {
cache.start();
}
}
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
> DefaultCacheManager startCaches do not restart stopped caches
> -------------------------------------------------------------
>
> Key: ISPN-4958
> URL: https://issues.jboss.org/browse/ISPN-4958
> Project: Infinispan
> Issue Type: Feature Request
> Components: Core
> Affects Versions: 5.2.7.Final, 7.0.0.Final
> Reporter: Mathieu Lachance
>
> using DefaultCacheManager#startCaches do not start a previously stopped cache:
> {code}
> CacheManager cacheManager = new DefaultCacheManager();
> cacheManager.startCaches("abc");
> Cache cache = cacheManager.getCache("abc");
> cache.stop();
> cacheManager.startCaches("abc");
> cache = cacheManager.getCache("abc);
> cache.get("def"); // trow IllegalStateException
> {code}
> {quote}
> java.lang.IllegalStateException: Cache 'abc' is in 'TERMINATED' state and so it does not accept new invocations. Either restart it or recreate the cache container.
> at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:110)
> at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
> at org.infinispan.commands.AbstractVisitor.visitGetKeyValueCommand(AbstractVisitor.java:104)
> at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:58)
> at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
> at org.infinispan.CacheImpl.get(CacheImpl.java:289)
> at org.infinispan.CacheImpl.get(CacheImpl.java:281)
> {quote}
> I think the issue is in the thread that will call the {{createCache(cacheName)}}.
> By looking at the 7.0.0.Final source code:
> {code}
> String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
> Thread thread = new Thread(threadName) {
> @Override
> public void run() {
> try {
> createCache(cacheName);
> } catch (RuntimeException e) {
> exception.set(e);
> } catch (Throwable t) {
> exception.set(new RuntimeException(t));
> }
> }
> };
> {code}
> I think we should do instead the following:
> {code}
> String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
> Thread thread = new Thread(threadName) {
> @Override
> public void run() {
> try {
> Cache cache = getCache(cacheName, false);
> if (cache == null) {
> createCache(cacheName);
> }
> else {
> if (!ComponentStatus.RUNNING.equals(cache.getStatus())) {
> cache.start();
> }
> }
> } catch (RuntimeException e) {
> exception.set(e);
> } catch (Throwable t) {
> exception.set(new RuntimeException(t));
> }
> }
> };
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4958) DefaultCacheManager startCaches do not restart stopped caches
by Mathieu Lachance (JIRA)
Mathieu Lachance created ISPN-4958:
--------------------------------------
Summary: DefaultCacheManager startCaches do not restart stopped caches
Key: ISPN-4958
URL: https://issues.jboss.org/browse/ISPN-4958
Project: Infinispan
Issue Type: Feature Request
Components: Core
Affects Versions: 7.0.0.Final, 5.2.7.Final
Reporter: Mathieu Lachance
using DefaultCacheManager#startCaches do not start a previously stopped cache:
{code}
CacheManager cacheManager = new DefaultCacheManager();
cacheManager.startCaches("abc");
Cache cache = cacheManager.getCache("abc");
cache.stop();
cacheManager.startCaches("abc");
cache = cacheManager.getCache("abc);
cache.get("def"); // trow IllegalStateException
{code}
{quote}
java.lang.IllegalStateException: Cache 'XXXX' is in 'TERMINATED' state and so it does not accept new invocations. Either restart it or recreate the cache container.
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:110)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitGetKeyValueCommand(AbstractVisitor.java:104)
at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:58)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.CacheImpl.get(CacheImpl.java:289)
at org.infinispan.CacheImpl.get(CacheImpl.java:281)
{quote}
I think the issue is in the thread that will call the {{createCache(cacheName)}}.
By looking at the 7.0.0.Final source code:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
createCache(cacheName);
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
I think we should do instead the following:
{code}
String threadName = "CacheStartThread," + globalConfiguration.transport().nodeName() + "," + cacheName;
Thread thread = new Thread(threadName) {
@Override
public void run() {
try {
Cache cache = getCache(cacheName, false);
if (cache == null) {
createCache(cacheName);
}
else {
if (!ComponentStatus.RUNNING.equals(cache.getStatus())) {
cache.start();
}
}
} catch (RuntimeException e) {
exception.set(e);
} catch (Throwable t) {
exception.set(new RuntimeException(t));
}
}
};
{code}
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-4949:
-----------------------------------
[~belaban] You're right that it limits scalability - I don't say this has to be the default, but it is still a responsibility of JGroups (transport layer) to manage group membership.
I think that we can't workaround the need for some consensus on 'who's the coordinator' with current Infinispan architecture - at best we could somehow limit the need for updates 'who's in the view' and keep that info only in coordinator.
Still, I think that it could work well even with limited scalability. In production there would be tens of thousands of messages per second per node, therefore, few one shouldn't matter. And when the cluster goes wild with network, degraded performance is to be expected.
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on ISPN-4949:
--------------------------------
Re consensus based view installation: JGroups doesn't use consensus for view installation, because it's simply faster to do without consensus, which includes round trips at a latency and a timeout. For large clusters, this won't be scalable. Imagine you need to install a new view in 500 nodes. This would include invoking ~ 500 RPCs, wait for all responses (or a timeout) and then do a second RPC committing the proposed view. Additional logic would be needed to handle dangling prepare and commit, ie. when the leader crashes after the PREPARE or COMMIT phase, plus possible vote collection when a new coord takes over.
I'll look into this though, so I created [1] as a result.
[1] JGRP-1901
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-4949:
-----------------------------------
One more note about ease of configuration: It would be great if the FD* + VERIFY_SUSPECT suite had an option to compute and provide guarantees how soon it should report node as non-responsive; any timeouts in upper layers (such as timeout for acking the view as I've outlined above) could be computed automatically.
Though, this it rather a nice-to-have feature; for now we need split-brain working, and we can decide about sensible timeout defaults (and document it).
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months
[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-4949:
-----------------------------------
We cannot suddenly start forcing users into odd numOwners, when they want consistent cluster. I believe that having each node in one and only one view is the way to go.
However, acking the update is not enough, imo. If the network partitioning is changing in rapid succession or if it is not transitive, the node could ack being in two views as well. We need means to both registration and deregistration from the view:
ABCD breaks into AB, CDB (not sure why you have corrected my example above, I wanted B not being coord):
1) A broadcasts view ABC, all of them ack the new view to A
2) C broadcasts view CDB, C and D ack but B is already in a view of different coord
3) B replies 'I am in another view, wait' and sends 'leave view' request to A:
4a) A responds 'removing from view...' and sends new view A (or any residual members) and let's it be acked, and after that sends 'you were removed' to B - then B can proceed with acking view to C
4b) Request to A times out and B can proceed with acking view to C
Generally, any RPC should be responded immediately in order to detect node responsibility, but it should assume that the action can take a while.
I think that it's JGroups responsibility to implement any group membership algorithm (current one is 'unreliable', so let's have one 'reliable' as an alternative required for split-brain) - although you can do that in Infinispan, let's keep the layers separate. Infinispan is not implementing RPC either, just 'because it's possible with JGroups API'.
[~belaban] Comments?
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 5 months