[infinispan-issues] [JBoss JIRA] Commented: (ISPN-428) Hashing generating recipient lists with same address

Monday, 10 May 2010

    [
https://jira.jboss.org/jira/browse/ISPN-428?page=com.atlassian.jira.plugi...
] 

Mircea Markus commented on ISPN-428:
------------------------------------

Problem:
1.A starts, B starts see view {A,B} , DistributionManagerImpl.start not called yet because
no distributed cache was started
2. a dist cache is started on A. A's consistent hash sees nodes {A,B} now (as
DistributionManagerImpl.start is called)
3. a dist cache is started on B. The JoinTask fetches A's DCH list of nodes, i.e.
{A,B}
4. B creates a hash function which contains {A,B} (as fetched from A) and itself: {A,B,B}
--- aftert this point DCH in B is unreliable, anyway here is how the timeout happens

5. B.put(k,v). B acquires lock on k, then B's DCH indicates that k should be placed on
B (!!!). Tries a remote call on B, but it will timeout as the lock on k is already held by
user thread that waits 

In other words, the problem is caused by the fact that the joiner doesn't expect
itself to be part of the hash function of the remote cache, but it is. I think that the
hash function should check for that, and drop duplicates.

Test: ReplTopologyChangeTest  in core

...
 Hashing generating recipient lists with same address
 ----------------------------------------------------

                 Key: ISPN-428
                 URL: https://jira.jboss.org/jira/browse/ISPN-428
             Project: Infinispan
          Issue Type: Bug
          Components: Distributed Cache
    Affects Versions: 4.1.0.BETA1
            Reporter: Galder Zamarreno
            Assignee: Galder Zamarreno
            Priority: Blocker
             Fix For: 4.1.0.CR1

 Hi all, 
 As indicated on IRC, running
org.infinispan.client.hotrod.TopologyChangeTest.testTwoMembers() fails randomly with
replication timeout. It's very easy to replicate. When it fails, this is what happens:

 1. During rehashing, a new hash is installed: 
 2010-05-06 17:54:11,960 4932  TRACE [org.infinispan.distribution.DistributionManagerImpl]
(Rehasher-eq-985:) Installing new consistent hash DefaultConsistentHash{addresses
={109=eq-35426, 10032=eq-985, 10033=eq-985}, hash space =10240}  
 2. Rehash finishes and the previous hash is still installed: 
 2010-05-06 17:54:11,978 4950  INFO  [org.infinispan.distribution.JoinTask]
(Rehasher-eq-985:) eq-985 completed join in 30 milliseconds!  
 3. A put comes in to eq-985 who decides recipients are [eq-985, eq-985]. Most likely, the
hash falled somewhere between 109 and 10032 and since owners are 2, it took the next 2: 
 2010-05-06 17:54:12,307 5279  TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(HotRodServerWorker-2-1:) eq-985 broadcasting call
PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=d28dfa, array=[-84, -19,
0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9, array=[-84, -19,
0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,
lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list [eq-985, eq-985]  
 Everything afterwards is a mess:  
 4. JGroups removes the local address from the destination. The reason Infinispan does not
do it it's because the number of recipients is 2 and the number of members in the
cluster 2, so it thinks it's a broadcast: 
 2010-05-06 17:54:12,308 5280  TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) real_dests=[eq-985]  
 5. JGroups still sends it as a broadcast: 
 2010-05-06 17:54:12,308 5280  TRACE [org.jgroups.protocols.TCP] (HotRodServerWorker-2-1:)
sending msg to null, src=eq-985, headers are RequestCorrelator: id=201, type=REQ, id=12,
rsp_expected=true, NAKACK: [MSG, seqno=5], TCP: [channel_name=Infinispan-Cluster]  
 6. Another node deals with this and replies: 
 2010-05-06 17:54:12,310 5282  TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-1,Infinispan-Cluster,eq-35426:) Attempting to execute command:
SingleRpcCommand{cacheName='___defaultcache',
command=PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=43487e,
array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9,
array=[-84, -19, 0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249},
putIfAbsent=false, lifespanMillis=-1000, maxIdleTimeMillis=-1000}} [sender=eq-985] ...  
 7. However, no replies yet from eq-985, so u get: 
 2010-05-06 17:54:27,310 20282 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) responses: [sender=eq-985, retval=null, received=false,
suspected=false]  2010-05-06 17:54:27,313 20285 TRACE
[org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:) replication
exception:  org.infinispan.util.concurrent.TimeoutException: Replication timeout for
eq-985  
 Now, I don't understand the reason for creating a hash 10032=eq-985, 10033=eq-985.
Shouldn't keeping 10032=eq-985 be enough? Why add 10033=eq-985?  Assuming there was a
valid case for it, a naive approach would be to discard a second node that points to the
an address already in the recipient list. So, 10032=eq-985 would be accepted for the list
but when encountering 10033=eq-985, this would be skipped.  
 Finally, I thought waiting for rehashing to finish would solve the issue but as u can see
in 2., rehashing finished and the hash is still in the same shape. Also, I've attached
a log file.  Cheers 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] Commented: (ISPN-428) Hashing generating recipient lists with same address