[infinispan-issues] [JBoss JIRA] Updated: (ISPN-428) Hashing generating recipient lists with same address

Galder Zamarreno (JIRA) jira-events at lists.jboss.org
Mon May 10 10:57:08 EDT 2010


     [ https://jira.jboss.org/jira/browse/ISPN-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Galder Zamarreno updated ISPN-428:
----------------------------------

        Fix Version/s: 4.1.0.BETA1
                           (was: 4.1.0.CR1)
    Affects Version/s: 4.1.0.ALPHA3
                           (was: 4.1.0.BETA1)


> Hashing generating recipient lists with same address
> ----------------------------------------------------
>
>                 Key: ISPN-428
>                 URL: https://jira.jboss.org/jira/browse/ISPN-428
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Distributed Cache
>    Affects Versions: 4.1.0.ALPHA3
>            Reporter: Galder Zamarreno
>            Assignee: Manik Surtani
>            Priority: Blocker
>             Fix For: 4.1.0.BETA1
>
>
> Hi all, 
> As indicated on IRC, running org.infinispan.client.hotrod.TopologyChangeTest.testTwoMembers() fails randomly with replication timeout. It's very easy to replicate. When it fails, this is what happens:  
> 1. During rehashing, a new hash is installed: 
> 2010-05-06 17:54:11,960 4932  TRACE [org.infinispan.distribution.DistributionManagerImpl] (Rehasher-eq-985:) Installing new consistent hash DefaultConsistentHash{addresses ={109=eq-35426, 10032=eq-985, 10033=eq-985}, hash space =10240}  
> 2. Rehash finishes and the previous hash is still installed: 
> 2010-05-06 17:54:11,978 4950  INFO  [org.infinispan.distribution.JoinTask] (Rehasher-eq-985:) eq-985 completed join in 30 milliseconds!  
> 3. A put comes in to eq-985 who decides recipients are [eq-985, eq-985]. Most likely, the hash falled somewhere between 109 and 10032 and since owners are 2, it took the next 2: 
> 2010-05-06 17:54:12,307 5279  TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:) eq-985 broadcasting call PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=d28dfa, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false, lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list [eq-985, eq-985]  
> Everything afterwards is a mess:  
> 4. JGroups removes the local address from the destination. The reason Infinispan does not do it it's because the number of recipients is 2 and the number of members in the cluster 2, so it thinks it's a broadcast: 
> 2010-05-06 17:54:12,308 5280  TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (HotRodServerWorker-2-1:) real_dests=[eq-985]  
> 5. JGroups still sends it as a broadcast: 
> 2010-05-06 17:54:12,308 5280  TRACE [org.jgroups.protocols.TCP] (HotRodServerWorker-2-1:) sending msg to null, src=eq-985, headers are RequestCorrelator: id=201, type=REQ, id=12, rsp_expected=true, NAKACK: [MSG, seqno=5], TCP: [channel_name=Infinispan-Cluster]  
> 6. Another node deals with this and replies: 
> 2010-05-06 17:54:12,310 5282  TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1,Infinispan-Cluster,eq-35426:) Attempting to execute command: SingleRpcCommand{cacheName='___defaultcache', command=PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=43487e, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false, lifespanMillis=-1000, maxIdleTimeMillis=-1000}} [sender=eq-985] ...  
> 7. However, no replies yet from eq-985, so u get: 
> 2010-05-06 17:54:27,310 20282 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (HotRodServerWorker-2-1:) responses: [sender=eq-985, retval=null, received=false, suspected=false]  2010-05-06 17:54:27,313 20285 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:) replication exception:  org.infinispan.util.concurrent.TimeoutException: Replication timeout for eq-985  
> Now, I don't understand the reason for creating a hash 10032=eq-985, 10033=eq-985. Shouldn't keeping 10032=eq-985 be enough? Why add 10033=eq-985?  Assuming there was a valid case for it, a naive approach would be to discard a second node that points to the an address already in the recipient list. So, 10032=eq-985 would be accepted for the list but when encountering 10033=eq-985, this would be skipped.  
> Finally, I thought waiting for rehashing to finish would solve the issue but as u can see in 2., rehashing finished and the hash is still in the same shape. Also, I've attached a log file.  Cheers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the infinispan-issues mailing list