[JBoss JIRA] (ISPN-5309) Model data consistency for get()/put() operations in Infinispan
by Richard Achmatowicz (JIRA)
[ https://issues.jboss.org/browse/ISPN-5309?page=com.atlassian.jira.plugin.... ]
Richard Achmatowicz commented on ISPN-5309:
-------------------------------------------
Dan, Bela and I started discussions on this a few days ago, believing it's probably a good time to get started on this.
This issue will collect comments, discussions.
> Model data consistency for get()/put() operations in Infinispan
> ---------------------------------------------------------------
>
> Key: ISPN-5309
> URL: https://issues.jboss.org/browse/ISPN-5309
> Project: Infinispan
> Issue Type: Task
> Reporter: Richard Achmatowicz
> Assignee: Richard Achmatowicz
>
> This will be the first in a series of modelling/validation exercises for some of the more critical Infinispan protocols.
> We shall use TLA+ / PlusCal / TLA+ Tools to do the following:
> - model the design of processing of get()/put() operations in an Infinispan cluster
> - model client interactions with that cluster
> - describe the data consistency requirements of get()/put() operations
> - verify that the data consistency semantics of Infinispan are preserved in the face of concurrent client interactions
> TLA+ / PLusCal can be thought of as a pseudo-code language which has a well-defined semantics and is testable using the tools in the TLA+ Toolkit.
> The benefits of such an exercise are that we end up with:
> - a specification of data consistency guarantees that Infinispan
> - a semantically precise pseudo-code description of the design of get()/put() processing
> - a verification that the protocol design is correct
> We start here with the simple case of modelling data consistency in the absence of failures. In later exercises, we aim to tackle rebalancing and non-blocking state transfer in the face of membership changes and partitions.
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
11 years
[JBoss JIRA] (ISPN-5309) Model data consistency for get()/put() operations in Infinispan
by Richard Achmatowicz (JIRA)
Richard Achmatowicz created ISPN-5309:
-----------------------------------------
Summary: Model data consistency for get()/put() operations in Infinispan
Key: ISPN-5309
URL: https://issues.jboss.org/browse/ISPN-5309
Project: Infinispan
Issue Type: Task
Reporter: Richard Achmatowicz
Assignee: Richard Achmatowicz
This will be the first in a series of modelling/validation exercises for some of the more critical Infinispan protocols.
We shall use TLA+ / PlusCal / TLA+ Tools to do the following:
- model the design of processing of get()/put() operations in an Infinispan cluster
- model client interactions with that cluster
- describe the data consistency requirements of get()/put() operations
- verify that the data consistency semantics of Infinispan are preserved in the face of concurrent client interactions
TLA+ / PLusCal can be thought of as a pseudo-code language which has a well-defined semantics and is testable using the tools in the TLA+ Toolkit.
The benefits of such an exercise are that we end up with:
- a specification of data consistency guarantees that Infinispan
- a semantically precise pseudo-code description of the design of get()/put() processing
- a verification that the protocol design is correct
We start here with the simple case of modelling data consistency in the absence of failures. In later exercises, we aim to tackle rebalancing and non-blocking state transfer in the face of membership changes and partitions.
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
11 years
[JBoss JIRA] (ISPN-5261) Transaction is recommited by joining node during merge
by Matej Čimbora (JIRA)
[ https://issues.jboss.org/browse/ISPN-5261?page=com.atlassian.jira.plugin.... ]
Matej Čimbora commented on ISPN-5261:
-------------------------------------
I came across a variation of the original problem described by this JIRA. The impact is slightly different - transaction is commited on TX originator (A) and one of the participating nodes (B), however it gets rollbacked on another node (C). When get(K) is invoked on A, B & C send response, however each of them sends a different 'version' of the entry, depending on success of the transaction on particular node.
Scenario - 4 nodes - edg-perf01-04
1. Transaction is started on edg-perf01, K is updated within transaction
{code}
13:55:14,641 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter] (DefaultStressor-1) start called on tx GlobalTransaction:<edg-perf01-60360>:181323:local
{code}
2. Transaction is successfully prepared on edg-perf02 & edg-perf03
3. edg-perf01 invokes a commit command, which is sent to 02 & 03, check messageId & seqno
{code}
13:55:36,061 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (DefaultStressor-1) real_dests=[edg-perf02-25521, edg-perf03-30958]
13:55:36,063 TRACE [org.jgroups.protocols.TCP] (DefaultStressor-1) edg-perf01-60360: sending msg to null, src=edg-perf01-60360, headers are RequestCorrelator: id=200, type=REQ, id=977458, rsp_expected=true, exclusion_list=[edg-perf01-60360], NAKACK2: [MSG, seqno=82297], 20903: slaveIndex=0, TCP: [cluster_name=default]
{code}
4. On edg-perf02, the message is received & delivered. Commit is performed & entry is stored in DataContainer
{code}
13:55:36,086 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (OOB-243,edg-perf02-25521) edg-perf02-25521: received edg-perf01-60360#82297
13:55:36,086 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (OOB-243,edg-perf02-25521) edg-perf02-25521: delivering edg-perf01-60360#82297
...
13:55:36,148 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-86) About to send back response SuccessfulResponse{responseValue=null} for command CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:remote, cacheName='testCache', topologyId=228}
{code}
5. As edg-perf03 received a new view, which does not contain TX originator, transaction gets rollbacked & removed from transaction table
{code}
13:55:36,018 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (OOB-2643,edg-perf03-30958) New view accepted: [edg-perf03-30958|70] (1) [edg-perf03-30958]
...
13:55:36,774 DEBUG [org.infinispan.transaction.TransactionTable] (transport-thread-24) Rolling back transaction GlobalTransaction:<edg-perf01-60360>:181323:remote because originator edg-perf01-60360 left the cluster
{code}
edg-perf03 then receives a new view, requests retransmission & edg-perf01 sends the commit once again
{code}
13:55:38,719 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,edg-perf03-30958) New view accepted: MergeView::[edg-perf02-25521|71] (4) [edg-perf02-25521, edg-perf03-30958, edg-perf01-60360, edg-perf04-55740], 2 subgroups: [edg-perf02-25521|69] (3) [edg-perf02-25521, edg-perf03-30958, edg-perf01-60360], [edg-perf04-55740|68] (1) [edg-perf04-55740]
...
13:55:40,798 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (Timer-3,edg-perf03-30958) edg-perf03-30958: sending XMIT_REQ ((7): {82296-82302}) to edg-perf01-60360
13:55:40,811 TRACE [org.jgroups.protocols.TCP] (Timer-3,edg-perf03-30958) edg-perf03-30958: sending msg to edg-perf01-60360, src=edg-perf03-30958, headers are NAKACK2: [XMIT_REQ, sender=edg-perf01-60360], 20903: slaveIndex=2, TCP: [cluster_name=default]
13:55:41,055 TRACE [org.jgroups.protocols.TCP] (INT-428,edg-perf03-30958) edg-perf03-30958: received [dst: edg-perf03-30958, src: edg-perf01-60360 (4 headers), size=64 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER|INTERNAL], headers are RequestCorrelator: id=200, type=REQ, id=977458, rsp_expected=true, exclusion_list=[edg-perf01-60360], NAKACK2: [XMIT_RSP, seqno=82297], 20903: slaveIndex=0, TCP: [cluster_name=default]
13:55:41,081 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (INT-428,edg-perf03-30958) edg-perf03-30958: received edg-perf01-60360#82297
13:55:41,081 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (INT-428,edg-perf03-30958) edg-perf03-30958: delivering edg-perf01-60360#82297
{code}
Transaction is not present in TX table and successful response is sent to edg-perf01
{code}
13:55:41,177 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (INT-428,edg-perf03-30958) Attempting to execute command: CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:local, cacheName='testCache', topologyId=228} [sender=edg-perf01-60360]
13:55:41,197 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-978) Calling perform() on CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:remote, cacheName='testCache', topologyId=228}
13:55:41,197 TRACE [org.infinispan.commands.tx.AbstractTransactionBoundaryCommand] (remote-thread-978) Did not find a RemoteTransaction for GlobalTransaction:<edg-perf01-60360>:181323:remote
13:55:41,215 TRACE [org.jgroups.blocks.RequestCorrelator] (remote-thread-978) sending rsp for 977458 to edg-perf01-60360
13:55:41,215 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-978) About to send back response SuccessfulResponse{responseValue=null} for command CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:remote, cacheName='testCache', topologyId=228}
{code}
6. edg-perf01 receives successful responses for commit command from both edg-perf02 & edg-perf03
{code}
13:55:41,230 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-1) Response(s) to CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:local, cacheName='testCache', topologyId=228} is {edg-perf03-30958=SuccessfulResponse{responseValue=null} , edg-perf02-25521=SuccessfulResponse{responseValue=null} }
{code}
7. Transaction is commited again in new topology, further not affecting K.
{code}
13:55:41,252 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-1) edg-perf01-60360 invoking CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:local, cacheName='testCache', topologyId=233} to recipient list [edg-perf02-25521, edg-perf03-30958, edg-perf04-55740] with options RpcOptions{timeout=60000, unit=MILLISECONDS, fifoOrder=false, totalOrder=false, responseFilter=null, responseMode=SYNCHRONOUS, skipReplicationQueue=false}
...
13:55:54,118 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-1) Response(s) to CommitCommand {gtx=GlobalTransaction:<edg-perf01-60360>:181323:local, cacheName='testCache', topologyId=233} is {edg-perf03-30958=SuccessfulResponse{responseValue=null} , edg-perf04-55740=SuccessfulResponse{responseValue=null} , edg-perf02-25521=SuccessfulResponse{responseValue=null} }
{code}
8. Later, when get(K) is invoked on edg-perf01, both 02 & 03 send their response. However, each of them possesses a different version of the entry, leading to unexpected return value when message from 03 is received first.
> Transaction is recommited by joining node during merge
> ------------------------------------------------------
>
> Key: ISPN-5261
> URL: https://issues.jboss.org/browse/ISPN-5261
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.1.1.Final
> Reporter: Matej Čimbora
> Assignee: Dan Berindei
>
> Transaction gets recommited after a node joins a cluster during merge and requests retransmission of previous messages, overwriting updates which happened till that point.
> 4 nodes - edg-perf01-04
> Scenario:
> 1. edg-perf01 begins a new transaction & updates value of K
> {code}
> 06:50:50,421 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter] (DefaultStressor-7) start called on tx GlobalTransaction:<edg-perf01-19349>:3516:local
> {code}
> The latest view on edg-perf01 is
> {code}
> 06:50:35,103 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf01-19349) ISPN000093: Received new, MERGED cluster view: MergeView::[edg-perf01-19349|5] (3) [edg-perf01-19349, edg-perf04-20753, edg-perf02-19191], 1 subgroups: [edg-perf01-19349|4] (2) [edg-perf01-19349, edg-perf02-19191]
> {code}
> 2. Prepare command is sent with nodes edg-perf02 & edg-perf04 responding successfully
> {code}
> 06:50:51,186 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-7) edg-perf01-19349 invoking PrepareCommand ... topologyId=12} to recipient list [edg-perf02-19191, edg-perf04-20753, edg-perf01-19349] ...
> {code}
> check message id & seqno
> {code}
> (06:50:51,195 TRACE [org.jgroups.protocols.TCP] (DefaultStressor-7) edg-perf01-19349: sending msg to null, src=edg-perf01-19349, headers are RequestCorrelator: id=200, type=REQ, id=17134, rsp_expected=true, exclusion_list=[edg-perf01-19349], NAKACK2: [MSG, seqno=1020], 20903: slaveIndex=0, TCP: [cluster_name=default])
> {code}
> 3. Transaction gets commited & value of K gets updated multiple times on edg-perf01 from this point on
> 4. edg-perf03 receives a new view containing all nodes
> {code}
> 06:50:53,963 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf03-14171) ISPN000093: Received new, MERGED cluster view: MergeView::[edg-perf01-19349|6] (4) [edg-perf01-19349, edg-perf03-14171, edg-perf04-20753, edg-perf02-19191], 3 subgroups: [edg-perf03-14171|4] (2) [edg-perf03-14171, edg-perf04-20753], [edg-perf01-19349|5] (3) [edg-perf01-19349, edg-perf04-20753, edg-perf02-19191], [edg-perf01-19349|4] (2) [edg-perf01-19349, edg-perf02-19191]
> {code}
> 5. edg-perf03 apparently requests retransmission of previous messages (containing prepare from step#2) & edg-perf01 sends the response
> Request
> {code}
> 06:51:35,241 TRACE [org.jgroups.protocols.pbcast.NAKACK2] (Timer-3,edg-perf03-14171) edg-perf03-14171: sending XMIT_REQ ((79): {1015-1093}) to edg-perf01-19349
> {code}
> Response
> {code}
> 06:51:35,299 TRACE [org.jgroups.protocols.TCP] (INT-14,edg-perf03-14171) edg-perf03-14171: received [dst: edg-perf03-14171, src: edg-perf01-19349 (4 headers), size=975 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER|INTERNAL], headers are RequestCorrelator: id=200, type=REQ, id=17134, rsp_expected=true, exclusion_list=[edg-perf01-19349], NAKACK2: [XMIT_RSP, seqno=1020], 20903: slaveIndex=0, TCP: [cluster_name=default]
> {code}
> 6. edg-perf03 prepares the old transaction & eventually commits it again
> {code}
> 06:51:35,304 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (INT-14,edg-perf03-14171) Attempting to execute command: PrepareCommand.....gtx=GlobalTransaction:<edg-perf01-19349>:3516:local, cacheName='testCache', topologyId=12} [sender=edg-perf01-19349]
> 06:51:35,304 TRACE [org.infinispan.statetransfer.StateTransferLockImpl] (INT-14,edg-perf03-14171) Checking if transaction data was received for topology 12, current topology is 14
> 06:51:35,305 TRACE [org.infinispan.transaction.TransactionTable] (remote-thread-40) Created and registered remote transaction ... tx=GlobalTransaction:<edg-perf01-19349>:3516:remote, ...
> 06:51:35,308 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (remote-thread-40) edg-perf03-14171 invoking PrepareCommand ... gtx=GlobalTransaction:<edg-perf01-19349>:3516:remote, cacheName='testCache', topologyId=15} to recipient list [edg-perf04-20753, edg-perf02-19191] ...
> {code}
> Successful commit responses
> {code}
> 06:51:36,750 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (remote-thread-44) Response(s) to CommitCommand {gtx=GlobalTransaction:<edg-perf01-19349>:3516:remote, cacheName='testCache', topologyId=15} is {edg-perf02-19191=SuccessfulResponse{responseValue=null} , edg-perf04-20753=SuccessfulResponse{responseValue=null} }
> {code}
> 7. When get(K) is invoked on edg-perf01, stale value is returned (the one from step #6, ignoring updates which have happened since step #3)
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
11 years
[JBoss JIRA] (ISPN-5046) PartitionHandling: split during commit can leave the cache inconsistent after merge
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-5046?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-5046:
-----------------------------------------------
Divya Mehra <dmehra(a)redhat.com> changed the Status of [bug 1176750|https://bugzilla.redhat.com/show_bug.cgi?id=1176750] from ON_QA to CLOSED
> PartitionHandling: split during commit can leave the cache inconsistent after merge
> -----------------------------------------------------------------------------------
>
> Key: ISPN-5046
> URL: https://issues.jboss.org/browse/ISPN-5046
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Affects Versions: 7.0.2.Final, 7.1.0.Alpha1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.2.0.Beta2
>
>
> Say we have a cluster ABCD; a transaction T was started on A, with B as the primary owner and C the backup owner. B and C both acknowledge the prepare, and the network splits into AB and CD right before A sends the commit command. Eventually A suspects C and D, but the commit still succeeds on B before C and D are suspected. And SuspectExceptions are ignored for commit commands, so the user won't see any error.
> However, C will eventually suspect A and B. When the CD cache topology is installed, it will roll back transaction T. After the merge, both partitions are in degraded mode, so we assume that they both have the latest data and the key is never updated on C.
> From C's point of view, this is very similar to ISPN-3421. The fix should also be similar, we could delay the transaction rollback on C until we get a confirmation from B that T was not committed there. Since B is inaccessible, it will eventually get a SuspectException and the CD cache topology, at which point the cache is in degraded mode and it can wait for a merge. On merge, it should check the status of the transaction on B again, and either commit or rollback based on what B did.
> We also need to suspend the cleanup of completed transactions while the cache is in degraded mode, otherwise C might not find T on B after the merge.
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
11 years
[JBoss JIRA] (ISPN-5307) Cache#remove(key) not behaving correctly under high load - while primary owner node is stopped
by Carsten Lohmann (JIRA)
[ https://issues.jboss.org/browse/ISPN-5307?page=com.atlassian.jira.plugin.... ]
Carsten Lohmann updated ISPN-5307:
----------------------------------
Description:
In our application we had the problem that under high load a {{Cache#remove(key)}} invocation returned the old value _although_ the same key was already removed on another node a few milliseconds before.
This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.
The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.
I could reproduce such an issue with the attached unit test (tested on ISPN 7.1).
Scenario:
- MultipleCacheManagersTest with 3 nodes
- Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; {{getTestCacheNumOwners()}} in the test)
- a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
- while the keys get removed one-by-one and concurrently by 2 threads...
- ... the coordinator node gets stopped
At the end, the return values of the {{remove()}} invocations get checked.
{{remove()}} invocations that threw an error (SuspectException) are ignored.
*The following problems occurred on some test runs*:
# the {{remove(key)}} invocations on both nodes returned {{null}} and the entry was still there in the end
#* => I would have expected an exception here as {{remove(key)}} essentially failed
# the {{remove(key)}} invocations on both nodes returned {{null}} but the entry was actually removed
#* => wrong return value of first invocation
# the {{remove(key)}} invocations on both nodes returned both the old value
#* => wrong return value of second invocation
These issues occur not always but quite often with the test.
When testing with {{numOwners=2}}, I only got the last of the above issues.
Attached are log files from tests with {{numOwners=1}}
("org.infinispan.transaction" log level set to TRACE):
- ~40% of test runs produced:
issue 3 ("2 remove() invocations returned the old value for key 'k4'")
- occurred rarely:
issue 2 ("remove() did actually remove the key 'k4', but the return value was null")
- didn't occur with "org.infinispan.transaction" log level set to TRACE; but otherwise quite often:
issue 1 ("remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')")
was:
In our application we had the problem that under high load a {{Cache#remove(key)}} invocation returned the old value _although_ the same key was already removed on another node a few milliseconds before.
This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.
The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.
I could reproduce such an issue with the attached unit test (tested on ISPN 7.1).
Scenario:
- MultipleCacheManagersTest with 3 nodes
- Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; {{getTestCacheNumOwners()}} in the test)
- a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
- while the keys get removed one-by-one and concurrently by 2 threads...
- ... the coordinator node gets stopped
At the end, the return values of the {{remove()}} invocations get checked.
{{remove()}} invocations that threw an error (SuspectException) are ignored.
*The following problems occurred on some test runs*:
# the {{remove(key)}} invocations on both nodes returned {{null}} and the entry was still there in the end
#* => I would have expected an exception here as {{remove(key)}} essentially failed
# the {{remove(key)}} invocations on both nodes returned {{null}} but the entry was actually removed
#* => wrong return value of first invocation
# the {{remove(key)}} invocations on both nodes returned both the old value
#* => wrong return value of second invocation
These issues occur not always but quite often with the test.
When testing with {{numOwners=2}}, I only got the last of the above issues - in the test with {{numOwners=1}} that occurred only rarely.
Attached are log files from tests with {{numOwners=1}}
("org.infinispan.transaction" log level set to TRACE):
- ~40% of test runs produced:
issue 3 ("2 remove() invocations returned the old value for key 'k4'")
- occurred rarely:
issue 2 ("remove() did actually remove the key 'k4', but the return value was null")
- didn't occur with "org.infinispan.transaction" log level set to TRACE; but otherwise quite often:
issue 1 ("remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')")
> Cache#remove(key) not behaving correctly under high load - while primary owner node is stopped
> ----------------------------------------------------------------------------------------------
>
> Key: ISPN-5307
> URL: https://issues.jboss.org/browse/ISPN-5307
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.1.1.Final
> Reporter: Carsten Lohmann
> Attachments: RemoveEntryWhilePrimaryOwnerStopsTest.java, RemoveEntryWhileSingleOwnerStopsTest_2RemoveOpsReturnedNonNullValue.log, RemoveEntryWhileSingleOwnerStopsTest_RemoveFailButNoException.log, RemoveEntryWhileSingleOwnerStopsTest_RemoveOkButReturnedNull.log
>
>
> In our application we had the problem that under high load a {{Cache#remove(key)}} invocation returned the old value _although_ the same key was already removed on another node a few milliseconds before.
> This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.
> The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.
>
> I could reproduce such an issue with the attached unit test (tested on ISPN 7.1).
> Scenario:
> - MultipleCacheManagersTest with 3 nodes
> - Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; {{getTestCacheNumOwners()}} in the test)
> - a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
> - while the keys get removed one-by-one and concurrently by 2 threads...
> - ... the coordinator node gets stopped
>
> At the end, the return values of the {{remove()}} invocations get checked.
> {{remove()}} invocations that threw an error (SuspectException) are ignored.
>
> *The following problems occurred on some test runs*:
> # the {{remove(key)}} invocations on both nodes returned {{null}} and the entry was still there in the end
> #* => I would have expected an exception here as {{remove(key)}} essentially failed
> # the {{remove(key)}} invocations on both nodes returned {{null}} but the entry was actually removed
> #* => wrong return value of first invocation
> # the {{remove(key)}} invocations on both nodes returned both the old value
> #* => wrong return value of second invocation
>
> These issues occur not always but quite often with the test.
> When testing with {{numOwners=2}}, I only got the last of the above issues.
> Attached are log files from tests with {{numOwners=1}}
> ("org.infinispan.transaction" log level set to TRACE):
> - ~40% of test runs produced:
> issue 3 ("2 remove() invocations returned the old value for key 'k4'")
> - occurred rarely:
> issue 2 ("remove() did actually remove the key 'k4', but the return value was null")
> - didn't occur with "org.infinispan.transaction" log level set to TRACE; but otherwise quite often:
> issue 1 ("remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')")
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)
11 years