[infinispan-issues] [JBoss JIRA] (ISPN-5166) During ST, previously removed entry was revived

Wednesday, 21 January 2015

    [
https://issues.jboss.org/browse/ISPN-5166?page=com.atlassian.jira.plugin....
] 

Radim Vansa edited comment on ISPN-5166 at 1/21/15 3:30 AM:
------------------------------------------------------------

Seems like edg-perf02 started a rebalance before shutting down, but edg-perf04 somehow
missed those topologies and the first topology it tries to install:

{noformat}
07:47:48,186 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=87, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf02-45949, edg-perf04-55497], throwable=null, viewId=26}
[sender=edg-perf02-45949]
07:50:08,162 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=89, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null,
availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525,
edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,158 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=88, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,175 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START,
sender=edg-perf02-45949, joinInfo=null, topologyId=90, rebalanceId=24,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns
= 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171,
edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,263 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf04-55497, joinInfo=null, topologyId=89, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,268 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf04-55497, joinInfo=null, topologyId=90, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null,
availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525,
edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,306 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START,
sender=edg-perf04-55497, joinInfo=null, topologyId=91, rebalanceId=24,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns
= 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171,
edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
{noformat}

I think the problem is that perf01 receives the REBALANCE_START command while it already
has a rebalance in progress. Normally this can only happen if the previous CH_UPDATE was
delayed, and this is fixed by executing a fake CH_UPDATE. But in this case the topology id
of the REBALANCE_START is just 1 higher than the current topology id, and the fake
CH_UPDATE is not executed, so the old state transfer is not cleaned up properly.
In particular, {{StateConsumerImpl.stateTransferTopologyId}} stays 90, and
{{StateResponseCommands}} sent for topology 90 are not ignored as they should be.

Radim, could you also attach the
ClusterTopologyManagerImpl|ClusterCacheStatus|PreferConsistencyStrategy logs from perf04
around 07:50:08?

was (Author: dan.berindei):
Seems like perf04 started a rebalance before shutting down, but perf04 somehow missed
those topologies and the first topology it tries to install:

{noformat}
07:47:48,186 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=87, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf02-45949, edg-perf04-55497], throwable=null, viewId=26}
[sender=edg-perf02-45949]
07:50:08,162 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=89, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null,
availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525,
edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,158 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf02-45949, joinInfo=null, topologyId=88, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,175 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START,
sender=edg-perf02-45949, joinInfo=null, topologyId=90, rebalanceId=24,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns
= 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171,
edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=26} [sender=edg-perf02-45949]
07:50:08,263 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf04-55497, joinInfo=null, topologyId=89, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (4)[edg-perf03-24234: 128+128,
edg-perf01-21525: 128+128, edg-perf02-45949: 128+128, edg-perf04-55497: 128+128]},
pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,268 CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE,
sender=edg-perf04-55497, joinInfo=null, topologyId=90, rebalanceId=23,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=null,
availabilityMode=AVAILABLE, actualMembers=[edg-perf03-24234, edg-perf01-21525,
edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
07:50:08,306 CacheTopologyControlCommand{cache=testCache, type=REBALANCE_START,
sender=edg-perf04-55497, joinInfo=null, topologyId=91, rebalanceId=24,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf03-24234: 200+56,
edg-perf01-21525: 153+103, edg-perf04-55497: 159+97]}, pendingCH=DefaultConsistentHash{ns
= 512, owners = (3)[edg-perf03-24234: 171+170, edg-perf01-21525: 171+171,
edg-perf04-55497: 170+171]}, availabilityMode=null, actualMembers=[edg-perf03-24234,
edg-perf01-21525, edg-perf04-55497], throwable=null, viewId=27} [sender=edg-perf04-55497]
{noformat}

I think the problem is that perf01 receives the REBALANCE_START command while it already
has a rebalance in progress. Normally this can only happen if the previous CH_UPDATE was
delayed, and this is fixed by executing a fake CH_UPDATE. But in this case the topology id
of the REBALANCE_START is just 1 higher than the current topology id, and the fake
CH_UPDATE is not executed, so the old state transfer is not cleaned up properly.
In particular, {{StateConsumerImpl.stateTransferTopologyId}} stays 90, and
{{StateResponseCommands}} sent for topology 90 are not ignored as they should be.

Radim, could you also attach the
ClusterTopologyManagerImpl|ClusterCacheStatus|PreferConsistencyStrategy logs from perf04
around 07:50:08?

...
 During ST, previously removed entry was revived
 -----------------------------------------------

                 Key: ISPN-5166
                 URL: https://issues.jboss.org/browse/ISPN-5166
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 7.1.0.Beta1
            Reporter: Radim Vansa
            Priority: Critical
         Attachments: topo.txt

 In transactional mode, entry was removed (transaction succeeded and the entry was removed
from data container), but later, when StateResponseCommand arrives, the entry is revived
with the old value.
 Here is how the remove is executed on edg-perf01:
 {code}
 07:50:08,276 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(DefaultStressor-5) Replication task sending PrepareCommand
{modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null,
valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false,
gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache',
topologyId=89} to single recipient edg-perf03-24234 with response mode GET_ALL
 07:50:08,368 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5)
Response(s) to PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E,
value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false,
gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache',
topologyId=89} is {edg-perf03-24234=null}
 07:50:08,369 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl]
(DefaultStressor-5) Forwarding command PrepareCommand
{modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null,
valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false,
gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache',
topologyId=90} to new targets [edg-perf03-24234]
 07:50:08,369 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5)
edg-perf01-21525 invoking PrepareCommand
{modifications=[RemoveCommand{key=key_000000000000050E, value=null, flags=null,
valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false,
gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache',
topologyId=90} to recipient list [edg-perf03-24234] with options RpcOptions{timeout=60000,
unit=MILLISECONDS, fifoOrder=false, totalOrder=false, responseFilter=null,
responseMode=SYNCHRONOUS, skipReplicationQueue=false}
 07:50:08,527 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-5)
Response(s) to PrepareCommand {modifications=[RemoveCommand{key=key_000000000000050E,
value=null, flags=null, valueMatcher=MATCH_ALWAYS}], onePhaseCommit=false,
gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheName='testCache',
topologyId=90} is {edg-perf03-24234=null}
 07:50:08,529 TRACE [org.infinispan.transaction.LocalTransaction] (DefaultStressor-5)
getCommitNodes recipients=[edg-perf01-21525, edg-perf03-24234], currentTopologyId=90,
members=[edg-perf03-24234, edg-perf01-21525, edg-perf04-55497], txTopologyId=88
 07:50:08,531 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(DefaultStressor-5) dests=[edg-perf01-21525, edg-perf03-24234], command=CommitCommand
{gtx=GlobalTransaction:<edg-perf01-21525>:957
 26:local, cacheName='testCache', topologyId=90}, mode=SYNCHRONOUS_IGNORE_LEAVERS,
timeout=60000
 07:50:08,531 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(DefaultStressor-5) Replication task sending CommitCommand
{gtx=GlobalTransaction:<edg-perf01-21525>:95726:local, cacheNam
 e='testCache', topologyId=90} to addresses [edg-perf03-24234] with response mode
GET_ALL
 07:50:08,555 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(DefaultStressor-5) Responses: [sender=edg-perf03-24234,
retval=SuccessfulResponse{responseValue=null} , received=true, suspected=false]
 07:50:08,555 TRACE [org.infinispan.interceptors.EntryWrappingInterceptor]
(DefaultStressor-5) About to commit entry
ReadCommittedEntry(49e1fe5d){key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459,
467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ],
oldValue=null, isCreated=false, isChanged=true, isRemoved=true, isValid=false,
skipRemoteGet=false, metadata=EmbeddedMetadata{version=null}}
 07:50:08,561 TRACE [org.infinispan.container.entries.ReadCommittedEntry]
(DefaultStressor-5) Updating entry (key=key_000000000000050E removed=true valid=false
changed=true created=false loaded=false value=[5 #19: 380, 431, 453, 459, 467, 650, 851,
1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ]
metadata=EmbeddedMetadata{version=null}, providedMetadata=null)
 07:50:08,561 TRACE [org.infinispan.statetransfer.CommitManager] (DefaultStressor-5)
Committed key=key_000000000000050E. Old discard policy=null. New discard
policy=DiscardPolicy{discardStateTransfer=true, discardXSiteStateTransfer=false}
 {code}
 The commit command on edg-perf03 is executed only after the StateResponseCommand is sent
(and received on edg-perf01):
 {code}
 07:50:10,736 TRACE [org.infinispan.interceptors.InvocationContextInterceptor]
(remote-thread-254) Invoked with command PutKeyValueCommand{key=key_000000000000050E,
value=[5 #19: 380, 431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685,
2950, 3702, 3936, 4188, 4973, ], flags=[CACHE_MODE_LOCAL, SKIP_REMOTE_LOOKUP,
PUT_FOR_STATE_TRANSFER, SKIP_SHARED_CACHE_STORE, SKIP_OWNERSHIP_CHECK,
IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP], putIfAbsent=false, valueMatcher=MATCH_ALWAYS,
metadata=EmbeddedMetadata{version=null}, successful=true} and InvocationContext
[org.infinispan.context.impl.LocalTxInvocationContext@3e54512]
 07:50:10,747 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter]
(remote-thread-254) start called on tx
GlobalTransaction:<edg-perf01-21525>:95972:local
 07:50:10,758 TRACE [org.infinispan.transaction.LocalTransaction] (remote-thread-254)
Adding modification PutKeyValueCommand{key=key_000000000000050E, value=[5 #19: 380, 431,
453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188,
4973, ], flags=[CACHE_MODE_LOCAL, SKIP_REMOTE_LOOKUP, PUT_FOR_STATE_TRANSFER,
SKIP_SHARED_CACHE_STORE, SKIP_OWNERSHIP_CHECK, IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP],
putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null},
successful=true}. Mod list is null
 07:50:10,763 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter]
(remote-thread-254) end called on tx
GlobalTransaction:<edg-perf01-21525>:95972:local(testCache)
 07:50:10,763 TRACE [org.infinispan.interceptors.InvocationContextInterceptor]
(remote-thread-254) Invoked with command PrepareCommand {modifications=null,
onePhaseCommit=false, gtx=GlobalTransaction:<edg-perf01-21525>:95972:local,
cacheName='testCache', topologyId=-1} and InvocationContext
[org.infinispan.context.impl.LocalTxInvocationContext@3df73149]
 07:50:10,779 TRACE [org.infinispan.interceptors.EntryWrappingInterceptor]
(remote-thread-254) About to commit entry
ReadCommittedEntry(206acaec){key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459,
467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ],
oldValue=null, isCreated=true, isChanged=true, isRemoved=false, isValid=true,
skipRemoteGet=false, metadata=EmbeddedMetadata{version=null}}
 07:50:10,779 TRACE [org.infinispan.statetransfer.CommitManager] (remote-thread-254)
Trying to commit. Key=key_000000000000050E. Operation Flag=PUT_FOR_STATE_TRANSFER, L1
invalidation=false
 07:50:10,782 TRACE [org.infinispan.container.DefaultDataContainer] (remote-thread-254)
Store ImmortalCacheEntry{key=key_000000000000050E, value=[5 #19: 380, 431, 453, 459, 467,
650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936, 4188, 4973, ]} in
container
 {code}
 Later, I can see that the value was really writen to the container (further read
retrieves it):
 {code}
 07:53:26,778 TRACE [org.infinispan.container.EntryFactoryImpl] (DefaultStressor-5)
Retrieved from container ImmortalCacheEntry{key=key_000000000000050E, value=[5 #19: 380,
431, 453, 459, 467, 650, 851, 1209, 1405, 1893, 2080, 2353, 2353, 2685, 2950, 3702, 3936,
4188, 4973, ]} (forceFetch=false, isLocal=true)
 {code} 

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-5166) During ST, previously removed entry was revived