[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-3878:
------------------------------------
I think the cancel command can't be sent asynchronously, because we want to know that nobody is sending state by the time the new rebalance starts. (The cancelling of the transfer tasks should happen during the handling of the CH_UPDATE that's sent by the new coordinator, not during the REBALANCE_START that follows.)
On the other hand, perhaps we don't need the CANCEL_STATE_TRANSFER commands at all, and we could just cancel all outbound transfer tasks when we install a new cache topology without a pending CH in StateProviderImpl.
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
> Key: ISPN-3878
> URL: https://issues.jboss.org/browse/ISPN-3878
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 6.0.1.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration updated ISPN-3878:
------------------------------------------
Bugzilla Update: Perform
Bugzilla References: https://bugzilla.redhat.com/show_bug.cgi?id=1049846
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
> Key: ISPN-3878
> URL: https://issues.jboss.org/browse/ISPN-3878
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 6.0.1.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-3878:
-----------------------------------
Could the CANCEL command be sent asynchronously?
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
> Key: ISPN-3878
> URL: https://issues.jboss.org/browse/ISPN-3878
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 6.0.1.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.... ]
Radim Vansa edited comment on ISPN-3878 at 1/8/14 5:57 AM:
-----------------------------------------------------------
Could the StateRequestCommand.Type.CANCEL_STATE_TRANSFER command be sent asynchronously? Is it rather an optimization, or is it required?
was (Author: rvansa):
Could the CANCEL command be sent asynchronously?
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
> Key: ISPN-3878
> URL: https://issues.jboss.org/browse/ISPN-3878
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 6.0.1.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
by Radim Vansa (JIRA)
Radim Vansa created ISPN-3878:
---------------------------------
Summary: Unhandled failing ST cancel leads to deadlock
Key: ISPN-3878
URL: https://issues.jboss.org/browse/ISPN-3878
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 6.0.1.Final
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
3. New rebalance is started, changing the CH - requested segment is not in the new CH
4. Some ST are canceled, the cancel command is sent and taking a long time
5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3829) Null value read with RR can be replaced by cache loader value
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-3829?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-3829:
-----------------------------------------------
Vojtech Juranek <vjuranek(a)redhat.com> changed the Status of [bug 1045579|https://bugzilla.redhat.com/show_bug.cgi?id=1045579] from ON_QA to VERIFIED
> Null value read with RR can be replaced by cache loader value
> -------------------------------------------------------------
>
> Key: ISPN-3829
> URL: https://issues.jboss.org/browse/ISPN-3829
> Project: Infinispan
> Issue Type: Bug
> Components: Loaders and Stores
> Affects Versions: 6.0.0.Final
> Reporter: William Burns
> Assignee: William Burns
> Labels: 620
> Fix For: 7.0.0.Final
>
>
> Currently the CacheLoaderInterceptor does the following check to determine if it should check the loader for a value
> {code}
> if (e == null || e.isNull() || e.getValue() == null) {
> {code}
> Unfortunately this means it checks the loader when a null value is in the entry when using RR. This can cause an issue if another transaction commits that key and puts a value that results in that value being inserted into the loader.
> This also is a performance issue for RR, since it has to check the loader over and over for a given key even if it was found null the first time.
> Initial thought is to do something like setSkipRemoteGet and that could actually be used for a dual purpose possibly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3737) L1 requestor registered after value read
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-3737?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-3737:
-----------------------------------------------
Radim Vansa <rvansa(a)redhat.com> changed the Status of [bug 1032545|https://bugzilla.redhat.com/show_bug.cgi?id=1032545] from ON_QA to VERIFIED
> L1 requestor registered after value read
> ----------------------------------------
>
> Key: ISPN-3737
> URL: https://issues.jboss.org/browse/ISPN-3737
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Cache
> Affects Versions: 6.0.0.Final
> Reporter: Radim Vansa
> Assignee: William Burns
> Priority: Critical
> Labels: 620
> Fix For: 6.0.1.Final, 7.0.0.Alpha1, 7.0.0.Final
>
>
> As the L1 requestor is registered only after the value is retrieved from data container, the (transactional) update of the value may not invalide the entry after write and the cache gets inconsistent.
> Consider this interleaving of operations (G=get request from other node, C=commit)
> R: read value -> old value
> C: update old -> new
> C: notify requestors for key
> R: add requestor for key
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years
[JBoss JIRA] (ISPN-3738) Entry version gets lost during topology change -> NPE
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-3738?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-3738:
-----------------------------------------------
Radim Vansa <rvansa(a)redhat.com> changed the Status of [bug 1032693|https://bugzilla.redhat.com/show_bug.cgi?id=1032693] from ON_QA to VERIFIED
> Entry version gets lost during topology change -> NPE
> -----------------------------------------------------
>
> Key: ISPN-3738
> URL: https://issues.jboss.org/browse/ISPN-3738
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Cache
> Affects Versions: 6.0.0.Final
> Reporter: Radim Vansa
> Assignee: Pedro Ruivo
> Priority: Critical
> Labels: 620
> Fix For: 6.0.1.Final, 7.0.0.Alpha1, 7.0.0.Final
>
>
> Replicated TX cache with WSC, A, B are in cluster, C is joining
> 0. The current CH already contains A and B as owners, C is joining (is not primary owner of anything yet). B is primary owner of K=V.
> 1. A sends PrepareCommand to B and C with put(K, V) (V is null on all nodes)
> 2. C receives PrepareCommand and responds with no versions (it is not primary owner)
> 3. topology changes on B - primary ownership of K is transfered to C
> 4. B receives PrepareCommand, responds without K's version (it is not primary)
> 5. B forwards the Prepare to C as it sees that the command has lower topology ID
> 6. C responds to B's prepare with version of K
> 7. K version is *not* added to B's response, B responds to A
> 8. A finds out that topology has changed, forwards prepare to C
> 9. C responds to C's prepare with version of K
> 10. A receives C's response, but the versions are not added to transaction
> 11. A sends out CommitCommand missing version of K
> 12. all nodes record K=V without version as usual ImmortalCacheEntry
> 13. the next time we try to increase version of K=V, we fail with NPE in SimpleClusteredVersionGenerator (actually when it tries to throw IllegalArgumentException because the null version is unexpected version class)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years