I've tried the the same operation sequence on the caches but it
works
without timeout. HR server also defines a cache for it's own purposes,
I'll try to include that cache as well in the setup and check again.
Do you have log for the attempt you did to replicate the issue below with only caches and
not HR servers? I'd like to see them to verify it.
The other cache you mention is a replicated cache, for topology info. I don't think it
has any bearings here.
On 7 May 2010, at 14:20, Manik Surtani wrote:
> So TopologyChangeTest is a pretty complex test involving HotRod
clients and servers, etc. Can this be reproduced in a simpler setting
- i.e., 2 p2p Infinispan instances, add a third, etc., without any
HotRod components?
>
> On 6 May 2010, at 17:51, galder(a)redhat.com wrote:
>
>> Hi all,
>>
>> As indicated on IRC, running
org.infinispan.client.hotrod.TopologyChangeTest.testTwoMembers() fails
randomly with replication timeout. It's very easy to replicate. When
it fails, this is what happens:
>>
>> 1. During rehashing, a new hash is installed:
>> 2010-05-06 17:54:11,960 4932 TRACE
[org.infinispan.distribution.DistributionManagerImpl]
(Rehasher-eq-985:) Installing new consistent hash
DefaultConsistentHash{addresses ={109=eq-35426, 10032=eq-985,
10033=eq-985}, hash space =10240}
>>
>> 2. Rehash finishes and the previous hash is still installed:
>> 2010-05-06 17:54:11,978 4950 INFO
[org.infinispan.distribution.JoinTask] (Rehasher-eq-985:) eq-985
completed join in 30 milliseconds!
>>
>> 3. A put comes in to eq-985 who decides recipients are [eq-985,
eq-985]. Most likely, the hash falled somewhere between 109 and 10032
and since owners are 2, it took the next 2:
>> 2010-05-06 17:54:12,307 5279 TRACE
[org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:)
eq-985 broadcasting call
PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9,
hashCode=d28dfa, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}},
value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0,
2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,
lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list
[eq-985, eq-985]
>>
>> Everything afterwards is a mess:
>>
>> 4. JGroups removes the local address from the destination. The
reason Infinispan does not do it it's because the number of recipients
is 2 and the number of members in the cluster 2, so it thinks it's a
broadcast:
>> 2010-05-06 17:54:12,308 5280 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) real_dests=[eq-985]
>>
>> 5. JGroups still sends it as a broadcast:
>> 2010-05-06 17:54:12,308 5280 TRACE [org.jgroups.protocols.TCP]
(HotRodServerWorker-2-1:) sending msg to null, src=eq-985, headers are
RequestCorrelator: id=201, type=REQ, id=12, rsp_expected=true, NAKACK:
[MSG, seqno=5], TCP: [channel_name=Infinispan-Cluster]
>>
>> 6. Another node deals with this and replies:
>> 2010-05-06 17:54:12,310 5282 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-1,Infinispan-Cluster,eq-35426:) Attempting to execute command:
SingleRpcCommand{cacheName='___defaultcache',
command=PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9,
hashCode=43487e, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}},
value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0,
2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,
lifespanMillis=-1000, maxIdleTimeMillis=-1000}} [sender=eq-985]
>> ...
>>
>> 7. However, no replies yet from eq-985, so u get:
>> 2010-05-06 17:54:27,310 20282 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(HotRodServerWorker-2-1:) responses: [sender=eq-985, retval=null,
received=false, suspected=false]
>>
>> 2010-05-06 17:54:27,313 20285 TRACE
[org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:)
replication exception:
>> org.infinispan.util.concurrent.TimeoutException: Replication
timeout for eq-985
>>
>> Now, I don't understand the reason for creating a hash
10032=eq-985, 10033=eq-985. Shouldn't keeping 10032=eq-985 be enough?
Why add 10033=eq-985?
>>
>> Assuming there was a valid case for it, a naive approach would be
to discard a second node that points to the an address already in the
recipient list. So, 10032=eq-985 would be accepted for the list but
when encountering 10033=eq-985, this would be skipped.
>>
>> Finally, I thought waiting for rehashing to finish would solve the
issue but as u can see in 2., rehashing finished and the hash is still
in the same shape. Also, I've attached a log file.
>>
>> Cheers,
>> --
>> Galder ZamarreƱo
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>>
<bad2_jgroups-infinispan.log.zip>_______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik(a)jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
>
http://www.infinispan.org
>
http://www.jbosscache.org
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev