I've enhanced the previous scenario by defining a replicated cache (same config as
topology cache with in memory ST enabled) and the test still passes.
I think I'm missing something from the interaction between HR server and and ispn
cluster, galder mind taking a look?
On 7 May 2010, at 14:44, Mircea Markus wrote:
I've tried the the same operation sequence on the caches but it
works without timeout. HR server also defines a cache for it's own purposes, I'll
try to include that cache as well in the setup and check again.
On 7 May 2010, at 14:20, Manik Surtani wrote:
> So TopologyChangeTest is a pretty complex test involving HotRod clients and servers,
etc. Can this be reproduced in a simpler setting - i.e., 2 p2p Infinispan instances, add
a third, etc., without any HotRod components?
>
> On 6 May 2010, at 17:51, galder(a)redhat.com wrote:
>
>> Hi all,
>>
>> As indicated on IRC, running
org.infinispan.client.hotrod.TopologyChangeTest.testTwoMembers() fails randomly with
replication timeout. It's very easy to replicate. When it fails, this is what
happens:
>>
>> 1. During rehashing, a new hash is installed:
>> 2010-05-06 17:54:11,960 4932 TRACE
[org.infinispan.distribution.DistributionManagerImpl] (Rehasher-eq-985:) Installing new
consistent hash DefaultConsistentHash{addresses ={109=eq-35426, 10032=eq-985,
10033=eq-985}, hash space =10240}
>>
>> 2. Rehash finishes and the previous hash is still installed:
>> 2010-05-06 17:54:11,978 4950 INFO [org.infinispan.distribution.JoinTask]
(Rehasher-eq-985:) eq-985 completed join in 30 milliseconds!
>>
>> 3. A put comes in to eq-985 who decides recipients are [eq-985, eq-985]. Most
likely, the hash falled somewhere between 109 and 10032 and since owners are 2, it took
the next 2:
>> 2010-05-06 17:54:12,307 5279 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(HotRodServerWorker-2-1:) eq-985 broadcasting call
PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=d28dfa, array=[-84, -19,
0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9, array=[-84, -19,
0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,
lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list [eq-985, eq-985]
>>
>> Everything afterwards is a mess:
>>
>> 4. JGroups removes the local address from the destination. The reason Infinispan
does not do it it's because the number of recipients is 2 and the number of members in
the cluster 2, so it thinks it's a broadcast:
>> 2010-05-06 17:54:12,308 5280 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) real_dests=[eq-985]
>>
>> 5. JGroups still sends it as a broadcast:
>> 2010-05-06 17:54:12,308 5280 TRACE [org.jgroups.protocols.TCP]
(HotRodServerWorker-2-1:) sending msg to null, src=eq-985, headers are RequestCorrelator:
id=201, type=REQ, id=12, rsp_expected=true, NAKACK: [MSG, seqno=5], TCP:
[channel_name=Infinispan-Cluster]
>>
>> 6. Another node deals with this and replies:
>> 2010-05-06 17:54:12,310 5282 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-1,Infinispan-Cluster,eq-35426:) Attempting to execute command:
SingleRpcCommand{cacheName='___defaultcache',
command=PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9, hashCode=43487e,
array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}}, value=CacheValue{data=ByteArray{size=9,
array=[-84, -19, 0, 5, 116, 0, 2, 118, 48, ..]}, version=281483566645249},
putIfAbsent=false, lifespanMillis=-1000, maxIdleTimeMillis=-1000}} [sender=eq-985]
>> ...
>>
>> 7. However, no replies yet from eq-985, so u get:
>> 2010-05-06 17:54:27,310 20282 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) responses: [sender=eq-985, retval=null, received=false,
suspected=false]
>>
>> 2010-05-06 17:54:27,313 20285 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(HotRodServerWorker-2-1:) replication exception:
>> org.infinispan.util.concurrent.TimeoutException: Replication timeout for eq-985
>>
>> Now, I don't understand the reason for creating a hash 10032=eq-985,
10033=eq-985. Shouldn't keeping 10032=eq-985 be enough? Why add 10033=eq-985?
>>
>> Assuming there was a valid case for it, a naive approach would be to discard a
second node that points to the an address already in the recipient list. So, 10032=eq-985
would be accepted for the list but when encountering 10033=eq-985, this would be skipped.
>>
>> Finally, I thought waiting for rehashing to finish would solve the issue but as u
can see in 2., rehashing finished and the hash is still in the same shape. Also, I've
attached a log file.
>>
>> Cheers,
>> --
>> Galder ZamarreƱo
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>>
<bad2_jgroups-infinispan.log.zip>_______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik(a)jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
>
http://www.infinispan.org
>
http://www.jbosscache.org
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev