[infinispan-issues] [JBoss JIRA] (ISPN-5166) During ST, previously removed entry was revived

Wed Jan 21 03:30:49 EST 2015

    [ https://issues.jboss.org/browse/ISPN-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033953#comment-13033953 ] 

Radim Vansa edited comment on ISPN-5166 at 1/21/15 3:30 AM:
------------------------------------------------------------

Seems like edg-perf02 started a rebalance before shutting down, but edg-perf04 somehow missed those topologies and the first topology it tries to install:

{noformat}
07:47:48,186 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=87, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf02-45949, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,162 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=89, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,158 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=88, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,175 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START, sender=edg-perf02-45949, joinInfo=null, topologyId=90, rebalanceId=24, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171, edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,263 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf04-55497, joinInfo=null, topologyId=89, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,268 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf04-55497, joinInfo=null, topologyId=90, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,306 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START, sender=edg-perf04-55497, joinInfo=null, topologyId=91, rebalanceId=24, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171, edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
{noformat}

I think the problem is that perf01 receives the REBALANCE_START command while it already has a rebalance in progress. Normally this can only happen if the previous CH_UPDATE was delayed, and this is fixed by executing a fake CH_UPDATE. But in this case the topology id of the REBALANCE_START is just 1 higher than the current topology id, and the fake CH_UPDATE is not executed, so the old state transfer is not cleaned up properly.
In particular, {{StateConsumerImpl.stateTransferTopologyId}} stays 90, and {{StateResponseCommands}} sent for topology 90 are not ignored as they should be.

Radim, could you also attach the ClusterTopologyManagerImpl|ClusterCacheStatus|PreferConsistencyStrategy logs from perf04 around 07:50:08?

was (Author: dan.berindei):
Seems like perf04 started a rebalance before shutting down, but perf04 somehow missed those topologies and the first topology it tries to install:

{noformat}
07:47:48,186 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=87, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf02-45949, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,162 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=89, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,158 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45949, joinInfo=null, topologyId=88, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,175 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START, sender=edg-perf02-45949, joinInfo=null, topologyId=90, rebalanceId=24, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171, edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,263 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf04-55497, joinInfo=null, topologyId=89, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128, edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,268 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf04-55497, joinInfo=null, topologyId=90, rebalanceId=23, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,306 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START, sender=edg-perf04-55497, joinInfo=null, topologyId=91, rebalanceId=24, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56, edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171, edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
{noformat}

I think the problem is that perf01 receives the REBALANCE_START command while it already has a rebalance in progress. Normally this can only happen if the previous CH_UPDATE was delayed, and this is fixed by executing a fake CH_UPDATE. But in this case the topology id of the REBALANCE_START is just 1 higher than the current topology id, and the fake CH_UPDATE is not executed, so the old state transfer is not cleaned up properly.
In particular, {{StateConsumerImpl.stateTransferTopologyId}} stays 90, and {{StateResponseCommands}} sent for topology 90 are not ignored as they should be.

Radim, could you also attach the ClusterTopologyManagerImpl|ClusterCacheStatus|PreferConsistencyStrategy logs from perf04 around 07:50:08?

> During ST, previously removed entry was revived
> -----------------------------------------------
>
>                 Key: ISPN-5166
>                 URL: https://issues.jboss.org/browse/ISPN-5166
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 7.1.0.Beta1
>            Reporter: Radim Vansa
>            Priority: Critical
>         Attachments: topo.txt
>
>
> In transactional mode, entry was removed (transaction succeeded and the entry was removed from data container), but later, when StateResponseCommand arrives, the entry is revived with the old value.
> Here is how the remove is executed on edg-perf01:
> {code}
> 07:50:08,276 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (DefaultStressor-5) Replication task sending PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache', topologyId=89} to single recipient edg-perf03-24234 with response mode GET_ALL
> 07:50:08,368 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5) Response(s) to PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache', topologyId=89} is {edg-perf03-24234=null}
> 07:50:08,369 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl] (DefaultStressor-5) Forwarding command PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache', topologyId=90} to new targets [edg-perf03-24234]
> 07:50:08,369 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5) edg-perf01-21525 invoking PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache', topologyId=90} to recipient list [edg-perf03-24234] with options RpcOptions{timeout=60000, unit=MILLISECONDS, fifoOrder=false, totalOrder=false, responseFilter=null, responseMode=SYNCHRONOUS, skipReplicationQueue=false}
> 07:50:08,527 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5) Response(s) to PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache', topologyId=90} is {edg-perf03-24234=null}
> 07:50:08,529 TRACE [org.infinispan.transaction.LocalTransaction] (DefaultStressor-5) getCommitNodes recipients=[edg-perf01-21525, edg-perf03-24234], currentTopologyId=90, members=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], txTopologyId=88
> 07:50:08,531 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (DefaultStressor-5) dests=[edg-perf01-21525, edg-perf03-24234], command=CommitCommand {gtx=GlobalTransaction:<edg-perf01-21525>:957
> 26:local, cacheName='testCache', topologyId=90}, mode=SYNCHRONOUS_IGNORE_LEAVERS, timeout=60000
> 07:50:08,531 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (DefaultStressor-5) Replication task sending CommitCommand {gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheNam
> e='testCache', topologyId=90} to addresses [edg-perf03-24234] with response mode GET_ALL
> 07:50:08,555 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (DefaultStressor-5) Responses: [sender=edg-perf03-24234, retval=SuccessfulResponse{responseValue=null} , received=true, suspected=false]
> 07:50:08,555 TRACE [org.infinispan.interceptors.EntryWrappingInterceptor] (DefaultStressor-5) About to commit entry ReadCommittedEntry(49e1fe5d){key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ], oldValue=null, isCreated=false, isChanged=true, isRemoved=true, isValid=false, skipRemoteGet=false, metadata=EmbeddedMetadata{version=null}}
> 07:50:08,561 TRACE [org.infinispan.container.entries.ReadCommittedEntry] (DefaultStressor-5) Updating entry (key=key_000000000000050E removed=true valid=false changed=true created=false loaded=false value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ] metadata=EmbeddedMetadata{version=null}, providedMetadata=null)
> 07:50:08,561 TRACE [org.infinispan.statetransfer.CommitManager] (DefaultStressor-5) Committed key=key_000000000000050E. Old discard policy=null. New discard policy=DiscardPolicy{discardStateTransfer=true, discardXSiteStateTransfer=false}
> {code}
> The commit command on edg-perf03 is executed only after the StateResponseCommand is sent (and received on edg-perf01):
> {code}
> 07:50:10,736 TRACE [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-254) Invoked with command PutKeyValueCommand{key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ], flags=[CACHE_MODE_LOCAL, SKIP_REMOTE_LOOKUP, PUT_FOR_STATE_TRANSFER, SKIP_SHARED_CACHE_STORE, SKIP_OWNERSHIP_CHECK, IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP], putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true} and InvocationContext [org.infinispan.context.impl.LocalTxInvocationContext at 3e54512]
> 07:50:10,747 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter] (remote-thread-254) start called on tx GlobalTransaction:<edg-perf01-21525>:95972:local
> 07:50:10,758 TRACE [org.infinispan.transaction.LocalTransaction] (remote-thread-254) Adding modification PutKeyValueCommand{key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ], flags=[CACHE_MODE_LOCAL, SKIP_REMOTE_LOOKUP, PUT_FOR_STATE_TRANSFER, SKIP_SHARED_CACHE_STORE, SKIP_OWNERSHIP_CHECK, IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP], putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}. Mod list is null
> 07:50:10,763 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter] (remote-thread-254) end called on tx GlobalTransaction:<edg-perf01-21525>:95972:local(testCache)
> 07:50:10,763 TRACE [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-254) Invoked with command PrepareCommand {modifications=null, onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95972:local, cacheName='testCache', topologyId=-1} and InvocationContext [org.infinispan.context.impl.LocalTxInvocationContext at 3df73149]
> 07:50:10,779 TRACE [org.infinispan.interceptors.EntryWrappingInterceptor] (remote-thread-254) About to commit entry ReadCommittedEntry(206acaec){key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ], oldValue=null, isCreated=true, isChanged=true, isRemoved=false, isValid=true, skipRemoteGet=false, metadata=EmbeddedMetadata{version=null}}
> 07:50:10,779 TRACE [org.infinispan.statetransfer.CommitManager] (remote-thread-254) Trying to commit. Key=key_000000000000050E. Operation Flag=PUT_FOR_STATE_TRANSFER, L1 invalidation=false
> 07:50:10,782 TRACE [org.infinispan.container.DefaultDataContainer] (remote-thread-254) Store ImmortalCacheEntry{key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ]} in container
> {code}
> Later, I can see that the value was really writen to the container (further read retrieves it):
> {code}
> 07:53:26,778 TRACE [org.infinispan.container.EntryFactoryImpl] (DefaultStressor-5) Retrieved from container ImmortalCacheEntry{key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ]} (forceFetch=false, isLocal=true)
> {code}

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)