<html><head><meta charset="utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>From <a href="https://jira.jboss.org/jira/browse/ISPN-428">https://jira.jboss.org/jira/browse/ISPN-428</a></div><div><br></div><div><span class="Apple-style-span" style="font-family: Arial, sans-serif; border-collapse: collapse; font-size: 12px; "><div id="comment-12529907-open"><div class="actionContainer" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; background-color: rgb(255, 255, 255); position: static; z-index: auto; "><div class="action-body" style="margin-top: 2px; margin-right: 2px; margin-bottom: 2px; margin-left: 2px; ">Problem: <br>1.A starts, B starts see view {A,B} , DistributionManagerImpl.start not called yet because no distributed cache was started <br>2. a dist cache is started on A. A's consistent hash sees nodes {A,B} now (as DistributionManagerImpl.start is called) <br>3. a dist cache is started on B. The JoinTask fetches A's DCH list of nodes, i.e. {A,B} <br>4. B creates a hash function which contains {A,B} (as fetched from A) and itself: {A,B,B} <br>--- aftert this point DCH in B is unreliable, anyway here is how the timeout happens <br><br>5. B.put(k,v). B acquires lock on k, then B's DCH indicates that k should be placed on B (!!!). Tries a remote call on B, but it will timeout as the lock on k is already held by user thread that waits <br><br>In other words, the problem is caused by the fact that the joiner doesn't expect itself to be part of the hash function of the remote cache, but it is. I think that the hash function should check for that, and drop duplicates. <br><font class="Apple-style-span" face="Helvetica"><span class="Apple-style-span" style="border-collapse: separate; font-size: medium;"><font class="Apple-style-span" face="Arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br></span></font></span></font></div></div></div></span></div><div>UT is <span class="Apple-style-span" style="border-collapse: collapse; font-family: Arial, sans-serif; font-size: 12px; ">ConcurrentStartWithReplTest</span></div><br><div><div>On 7 May 2010, at 16:16, Galder Zamarreno wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><br>----- "Mircea Markus" <<a href="mailto:mircea.markus@jboss.com">mircea.markus@jboss.com</a>> wrote:<br><br><blockquote type="cite">I've tried the the same operation sequence on the caches but it works<br></blockquote><blockquote type="cite">without timeout. HR server also defines a cache for it's own purposes,<br></blockquote><blockquote type="cite">I'll try to include that cache as well in the setup and check again.<br></blockquote><br>Do you have log for the attempt you did to replicate the issue below with only caches and not HR servers? I'd like to see them to verify it.<br><br>The other cache you mention is a replicated cache, for topology info. I don't think it has any bearings here.<br><br><blockquote type="cite"><br></blockquote><blockquote type="cite">On 7 May 2010, at 14:20, Manik Surtani wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">So TopologyChangeTest is a pretty complex test involving HotRod<br></blockquote></blockquote><blockquote type="cite">clients and servers, etc. Can this be reproduced in a simpler setting<br></blockquote><blockquote type="cite">- i.e., 2 p2p Infinispan instances, add a third, etc., without any<br></blockquote><blockquote type="cite">HotRod components?<br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">On 6 May 2010, at 17:51, <a href="mailto:galder@redhat.com">galder@redhat.com</a> wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Hi all,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">As indicated on IRC, running<br></blockquote></blockquote></blockquote><blockquote type="cite">org.infinispan.client.hotrod.TopologyChangeTest.testTwoMembers() fails<br></blockquote><blockquote type="cite">randomly with replication timeout. It's very easy to replicate. When<br></blockquote><blockquote type="cite">it fails, this is what happens:<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">1. During rehashing, a new hash is installed:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:11,960 4932 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.distribution.DistributionManagerImpl]<br></blockquote><blockquote type="cite">(Rehasher-eq-985:) Installing new consistent hash<br></blockquote><blockquote type="cite">DefaultConsistentHash{addresses ={109=eq-35426, 10032=eq-985,<br></blockquote><blockquote type="cite">10033=eq-985}, hash space =10240}<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2. Rehash finishes and the previous hash is still installed:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:11,978 4950 INFO <br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.distribution.JoinTask] (Rehasher-eq-985:) eq-985<br></blockquote><blockquote type="cite">completed join in 30 milliseconds!<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">3. A put comes in to eq-985 who decides recipients are [eq-985,<br></blockquote></blockquote></blockquote><blockquote type="cite">eq-985]. Most likely, the hash falled somewhere between 109 and 10032<br></blockquote><blockquote type="cite">and since owners are 2, it took the next 2:<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:12,307 5279 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:)<br></blockquote><blockquote type="cite">eq-985 broadcasting call<br></blockquote><blockquote type="cite">PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9,<br></blockquote><blockquote type="cite">hashCode=d28dfa, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}},<br></blockquote><blockquote type="cite">value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0,<br></blockquote><blockquote type="cite">2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,<br></blockquote><blockquote type="cite">lifespanMillis=-1000, maxIdleTimeMillis=-1000} to recipient list<br></blockquote><blockquote type="cite">[eq-985, eq-985]<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Everything afterwards is a mess:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">4. JGroups removes the local address from the destination. The<br></blockquote></blockquote></blockquote><blockquote type="cite">reason Infinispan does not do it it's because the number of recipients<br></blockquote><blockquote type="cite">is 2 and the number of members in the cluster 2, so it thinks it's a<br></blockquote><blockquote type="cite">broadcast:<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:12,308 5280 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]<br></blockquote><blockquote type="cite">(HotRodServerWorker-2-1:) real_dests=[eq-985]<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">5. JGroups still sends it as a broadcast:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:12,308 5280 TRACE [org.jgroups.protocols.TCP]<br></blockquote></blockquote></blockquote><blockquote type="cite">(HotRodServerWorker-2-1:) sending msg to null, src=eq-985, headers are<br></blockquote><blockquote type="cite">RequestCorrelator: id=201, type=REQ, id=12, rsp_expected=true, NAKACK:<br></blockquote><blockquote type="cite">[MSG, seqno=5], TCP: [channel_name=Infinispan-Cluster]<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">6. Another node deals with this and replies:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:12,310 5282 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]<br></blockquote><blockquote type="cite">(OOB-1,Infinispan-Cluster,eq-35426:) Attempting to execute command:<br></blockquote><blockquote type="cite">SingleRpcCommand{cacheName='___defaultcache',<br></blockquote><blockquote type="cite">command=PutKeyValueCommand{key=CacheKey{data=ByteArray{size=9,<br></blockquote><blockquote type="cite">hashCode=43487e, array=[-84, -19, 0, 5, 116, 0, 2, 107, 48, ..]}},<br></blockquote><blockquote type="cite">value=CacheValue{data=ByteArray{size=9, array=[-84, -19, 0, 5, 116, 0,<br></blockquote><blockquote type="cite">2, 118, 48, ..]}, version=281483566645249}, putIfAbsent=false,<br></blockquote><blockquote type="cite">lifespanMillis=-1000, maxIdleTimeMillis=-1000}} [sender=eq-985]<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">...<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">7. However, no replies yet from eq-985, so u get:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:27,310 20282 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]<br></blockquote><blockquote type="cite">(HotRodServerWorker-2-1:) responses: [sender=eq-985, retval=null,<br></blockquote><blockquote type="cite">received=false, suspected=false]<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2010-05-06 17:54:27,313 20285 TRACE<br></blockquote></blockquote></blockquote><blockquote type="cite">[org.infinispan.remoting.rpc.RpcManagerImpl] (HotRodServerWorker-2-1:)<br></blockquote><blockquote type="cite">replication exception: <br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">org.infinispan.util.concurrent.TimeoutException: Replication<br></blockquote></blockquote></blockquote><blockquote type="cite">timeout for eq-985<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Now, I don't understand the reason for creating a hash<br></blockquote></blockquote></blockquote><blockquote type="cite">10032=eq-985, 10033=eq-985. Shouldn't keeping 10032=eq-985 be enough?<br></blockquote><blockquote type="cite">Why add 10033=eq-985?<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Assuming there was a valid case for it, a naive approach would be<br></blockquote></blockquote></blockquote><blockquote type="cite">to discard a second node that points to the an address already in the<br></blockquote><blockquote type="cite">recipient list. So, 10032=eq-985 would be accepted for the list but<br></blockquote><blockquote type="cite">when encountering 10033=eq-985, this would be skipped.<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Finally, I thought waiting for rehashing to finish would solve the<br></blockquote></blockquote></blockquote><blockquote type="cite">issue but as u can see in 2., rehashing finished and the hash is still<br></blockquote><blockquote type="cite">in the same shape. Also, I've attached a log file.<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Cheers,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">--<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Galder Zamarreņo<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Sr. Software Engineer<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Infinispan, JBoss Cache<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><bad2_jgroups-infinispan.log.zip>_______________________________________________<br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">infinispan-dev mailing list<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">--<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Manik Surtani<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="mailto:manik@jboss.org">manik@jboss.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Lead, Infinispan<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Lead, JBoss Cache<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="http://www.infinispan.org">http://www.infinispan.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="http://www.jbosscache.org">http://www.jbosscache.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">_______________________________________________<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">infinispan-dev mailing list<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">_______________________________________________<br></blockquote><blockquote type="cite">infinispan-dev mailing list<br></blockquote><blockquote type="cite"><a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br></blockquote><blockquote type="cite"><a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br></blockquote><br>_______________________________________________<br>infinispan-dev mailing list<br><a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br><a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a></div></blockquote></div><br></body></html>