[
https://jira.jboss.org/jira/browse/ISPN-425?page=com.atlassian.jira.plugi...
]
Galder Zamarreno commented on ISPN-425:
---------------------------------------
I think I'm pretty close to the root cause of this. First of all, forget about my last
comment, I was looking at the wrong thing.
To recapitulate, the sequence of events leading to this is:
1. start 2 caches C1 and C2.
2. put a key K such that K maps on to C1 and C2
3. add a new node, C3. K should now map to C1 and C3.
4. Modify the value on C3 *before* rehashing completes.
5. See if we see the stale value on C2.
The root cause appears to be in that the put in 4. can happen before any state from C1/C2
has been applied to this. This means that the put is executed first and then state is
applied. So, if the put had v2 and state had, which contained previous value was v1, then
v1 would remain in C3 and v2 would be lost.
In the original log, you see:
2010-04-30 18:02:19,898 6037 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(HotRodServerWorker-10-1:) eq-7969 broadcasting call
PutKeyValueCommand{key=CacheKey{data=ByteArray{size=39, hashCode=172cbae, array=[107, 45,
116, 101, 115, 116, 68, 105, 115, 116, ..]}}, value=CacheValue{data=ByteArray{size=40,
array=[118, 54, 45, 116, 101, 115, 116, 68, 105, 115, ..]}, version=562962838323752},
putIfAbsent=false, lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list
[eq-7969, eq-61332]
...
2010-04-30 18:02:19,901 6040 TRACE
[org.infinispan.interceptors.InvocationContextInterceptor] (Rehasher-eq-7969:) Invoked
with command PutKeyValueCommand{key=CacheKey{data=ByteArray{size=39, hashCode=1b2c86e,
array=[107, 45, 116, 101, 115, 116, 68, 105, 115, 116, ..]}},
value=CacheValue{data=ByteArray{size=40, array=[118, 53, 45, 116, 101, 115, 116, 68, 105,
115, ..]}, version=281483566645751}, putIfAbsent=false, lifespanMillis=-1,
maxIdleTimeMillis=-1} and InvocationContext
[NonTxInvocationContext{flags=[CACHE_MODE_LOCAL, SKIP_REMOTE_LOOKUP]}]
At 6037, "118, 54, 45, 116..." value is put but this is later overriden by
"118, 53, 45, 116," as a result of applying state.
Now, there're two invalidations that should occur in eq-13415. As mentioned above, we
have the invalidation related to the the put call. As indicated before, eq-13415 does not
invalidate cos it considers the key to be local. Afterwards, there's a another
invalidation that happens when rehashing finishes as part of
RehashTask.invalidateInvalidHolders. These results in the following call:
2010-04-30 18:02:19,924 6063 TRACE [org.infinispan.interceptors.CallInterceptor]
(OOB-2,Infinispan-Cluster,eq-13415:) Executing command:
InvalidateL1Command{keys=[CacheKey{data=ByteArray{size=39, hashCode=c27dc4, array=[107,
45, 116, 101, 115, 116, 68, 105, 115, 116, ..]}}]}.
...
2010-04-30 18:02:19,924 6063 TRACE [org.infinispan.container.entries.ReadCommittedEntry]
(OOB-2,Infinispan-Cluster,eq-13415:) Updating entry (key=CacheKey{data=ByteArray{size=39,
hashCode=c27dc4, array=[107, 45, 116, 101, 115, 116, 68, 105, 115, 116, ..]}}
removed=false valid=true changed=true created=false
value=CacheValue{data=ByteArray{size=40, array=[118, 53, 45, 116, 101, 115, 116, 68, 105,
115, ..]}, version=281483566645751}]
The update entry message is basically saying that I'm not removing this entry but
instead moving it to L1. This is because isL1OnRehash is true, which is the default
setting.
I've managed to replicate the issue on a unit test and even though the final result is
the same it's worth noting the following difference in the log:
In the unit test, I had isL1OnRehash set to false, so the end result is that invalidation
as a result of rehashing removes the entry. So, when the get key command is executed, it
doesn't found the entry locally and hence calls a clustered get command (in the
previous case, this would have resolved locally), which still returns v1:
2010-05-10 09:40:58,323 4869 TRACE [org.infinispan.interceptors.CallInterceptor]
(OOB-1,Infinispan-Cluster,eq-12996:) Executing command:
InvalidateL1Command{keys=[org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c]}.
...
2010-05-10 09:40:58,323 4869 TRACE [org.infinispan.container.entries.ReadCommittedEntry]
(OOB-1,Infinispan-Cluster,eq-12996:) Updating entry
(key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c removed=true
valid=false changed=true created=false value=v2]
...
2010-05-10 09:40:59,310 5856 TRACE [org.infinispan.interceptors.CallInterceptor] (main:)
Executing command:
GetKeyValueCommand{key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c}.
2010-05-10 09:40:59,310 5856 TRACE [org.infinispan.commands.read.GetKeyValueCommand]
(main:) Entry not found
...
2010-05-10 09:40:59,311 5857 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main:) dests=[eq-22371,
eq-1418],
command=ClusteredGetCommand{key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c,
dataContainer=null}, mode=SYNCHRONOUS, timeout=60000
...
2010-05-10 09:40:59,313 5859 TRACE [org.infinispan.interceptors.CallInterceptor]
(OOB-3,Infinispan-Cluster,eq-22371:) Executing command:
GetKeyValueCommand{key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c}.
2010-05-10 09:40:59,313 5859 TRACE [org.infinispan.commands.read.GetKeyValueCommand]
(OOB-3,Infinispan-Cluster,eq-22371:) Found value
ImmortalCacheEntry{cacheValue=ImmortalCacheValue{value=v1}}
ImmortalCacheEntry{key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c}x
...
2010-05-10 09:40:59,314 5860 TRACE [org.infinispan.interceptors.CallInterceptor] (main:)
Executing command:
PutKeyValueCommand{key=org.infinispan.distribution.L1InvalidationBugTest$CustomKey@bd14851c,
value=v1, putIfAbsent=false, lifespanMillis=600000, maxIdleTimeMillis=-1}.
I'm attaching both logs.
Stale data read when L1 invalidation happens while
UnionConsistentHash is in use
--------------------------------------------------------------------------------
Key: ISPN-425
URL:
https://jira.jboss.org/jira/browse/ISPN-425
Project: Infinispan
Issue Type: Bug
Components: Distributed Cache
Affects Versions: 4.1.0.BETA1
Reporter: Galder Zamarreno
Assignee: Galder Zamarreno
Priority: Blocker
Fix For: 4.1.0.CR1
See below:
----- "Manik Surtani" <manik(a)jboss.org> wrote:
> On 3 May 2010, at 08:51, Galder Zamarreno wrote:
>
> > Resending without log until the message is approved.
> >
> > --
> > Galder Zamarreño
> > Sr. Software Engineer
> > Infinispan, JBoss Cache
> >
> > ----- Forwarded Message -----
> > From: galder(a)redhat.com
> > To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org>
> > Sent: Friday, April 30, 2010 6:30:05 PM GMT +01:00 Amsterdam /
> Berlin / Bern / Rome / Stockholm / Vienna
> > Subject: Stale data read when L1 invalidation happens while
> UnionConsistentHash is in use
> >
> > Hi,
> >
> > I've spent all day chasing down a random Hot Rod testsuite failure
> related to distribution. This is the last hurdle to close
>
https://jira.jboss.org/jira/browse/ISPN-411. In
> HotRodDistributionTest, which is still to be committed, I test adding
> a new node, doing a put on this node, and then doing a get in a
> different node and making sure that I get what was put. The test
> randomly fails saying that the get returns the old value. The failure
> is nothing to do with Hot Rod itself but rather a race condition where
> union consistent hash is used. Let me explain:
> >
> > 1. An earlier operation had set
> "k-testDistributedPutWithTopologyChanges" key to
> "v5-testDistributedPutWithTopologyChanges".
> > 2. Start a new hot rod server in eq-7969.
> > 2. eq-7969 node calls a put on that key with
> "v6-testDistributedPutWithTopologyChanges". Recipients for the put
> are: eq-7969 and eq-61332.
> > 3. eq-7969 sends an invalidate L1 to all, including eq-13415
> > 4. eq-13415 should invalidate
> "k-testDistributedPutWithTopologyChanges" but it doesn't, since it
> considers that "k-testDistributedPutWithTopologyChanges" is local to
> eq-13415:
> >
> > 2010-04-30 18:02:19,907 6046 TRACE
> [org.infinispan.distribution.DefaultConsistentHash]
> (OOB-2,Infinispan-Cluster,eq-13415:) Hash code for key
> CacheKey{data=ByteArray{size=39, hashCode=17b1683, array=[107, 45,
> 116, 101, 115, 116, 68, 105, 115, 116, ..]}} is 344897059
> > 2010-04-30 18:02:19,907 6046 TRACE
> [org.infinispan.distribution.DefaultConsistentHash]
> (OOB-2,Infinispan-Cluster,eq-13415:) Candidates for key
> CacheKey{data=ByteArray{size=39, hashCode=17b1683, array=[107, 45,
> 116, 101, 115, 116, 68, 105, 115, 116, ..]}} are {5458=eq-7969,
> 6831=eq-61332}
> > 2010-04-30 18:02:19,907 6046 TRACE
> [org.infinispan.distribution.DistributionManagerImpl]
> (OOB-2,Infinispan-Cluster,eq-13415:) Is local
> CacheKey{data=ByteArray{size=39, hashCode=17b1683, array=[107, 45,
> 116, 101, 115, 116, 68, 105, 115, 116, ..]}} to eq-13415 query returns
> true and consistentHash is
> org.infinispan.distribution.UnionConsistentHash@10747b4
> >
> > This is a log with log messages that I added to debug it. The key
> factor here is that UnionConsistentHash is in use, probably due to
> rehashing not having fully finished.
> >
> > 5. The end result is that a read of
> "k-testDistributedPutWithTopologyChanges" in eq-13415 returns
> "v5-testDistributedPutWithTopologyChanges".
> >
> > I thought that maybe we could be more conservative here and if
> rehashing is in progress (or UnionConsistentHash is in use) invalidate
> regardless. Assuming that a put always follows an invalidation in
> distribution and not viceversa, that would be fine. The only downside
> is that you'd be invalidating too much but put would replace the data
> in the node where invalidation should not have happened but it did, so
> not a problem.
> >
> > Thoughts? Alternatively, maybe I need to shape my test so that I
> wait for rehashing to finish, but the problem would still be there.
>
> Yes, this seems to be a bug with concurrent rehashing and invalidation
> rather than HotRod.
>
> Could you modify your test to so the following:
>
> 1. start 2 caches C1 and C2.
> 2. put a key K such that K maps on to C1 and C2
> 3. add a new node, C3. K should now map to C1 and C3.
> 4. Modify the value on C1 *before* rehashing completes.
> 5. See if we see the stale value on C2.
>
> To do this you would need a custom object for K that hashes the way
> you would expect (this could be hardcoded) and a value which blocks
> when serializing so we can control how long rehashing takes.
Since logical addresses are used underneath and these change from one run to the other,
I'm not sure how I can generate such key programatically. It's even more
complicated to figure out a key that will later, when C3 starts, map to it. Without having
these addresses locked somehow, or their hash codes, I can't see how this is doable.
IOW, to be able to do this, I need to mock these addresses into giving fixed as hash
codes. I'll dig further into this.
>
> I never promised the test would be simple! :)
>
> Cheers
> Manik
> --
> Manik Surtani
> manik(a)jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
>
http://www.infinispan.org
>
http://www.jbosscache.org
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira