[infinispan-dev] ISPN-425 - Issues with waiting for rehash to complete on startup

galder at redhat.com galder at redhat.com
Mon May 10 07:41:28 EDT 2010


Hi,

Re: https://jira.jboss.org/jira/browse/ISPN-425

We've been discussing solutions to the fundamental problem in this issue which is the fact that operations are allowed in the cache before rehashing has finished starting up. I've been playing around with a solution based around waiting for rehashing to complete but this is causing issues with Hot Rod distribution tests. In Hot Rod, this is what happens:

1. Start Hot Rod server 1 which starts a replicated topology cache.
2. Start Hot Rod server 2 which starts a replicated topology cache.
3. Send a request for a distributed cache called 'hotRodDistSync' in Hot Rod server 2.
4. As a result of this request, 'hotRodDistSync' cache should be started up but it does not succeed. It stays in a Rehash join loop, saying:

4595  INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,Infinispan-Cluster,eq-52045:) Cache named hotRodDistSync does not exist on this cache manager!
4595  TRACE [org.infinispan.marshall.VersionAwareMarshaller] (OOB-2,Infinispan-Cluster,eq-52045:) Wrote version 410
4596  TRACE [org.infinispan.marshall.VersionAwareMarshaller] (OOB-2,Infinispan-Cluster,eq-64501:) Read version 410
4596  TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (Rehasher-eq-64501:) responses: [sender=eq-52045, retval=null, received=true, suspected=false]

4597  DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-eq-64501:) Retrieved old consistent hash address list null
4597  TRACE [org.infinispan.distribution.JoinTask] (Rehasher-eq-64501:) Sleeping for 1.54 seconds

The problem here is that Hot Rod server 1 has not yet started 'hotRodDistSync' cache since no requests where sent to it. Now, this is different to the cache not allowing invocations yet cos it's in middle of the startup. So, I wondered if InboundInvocationHandlerImpl.handle() could return a custom response rather than null and for JoinTask to handle it in such a way that if all the responses received say that the cache does not exist, then consider rehash completed and finish the process.

Now, the reason I'm saying to return a custom response is because I can see that JOIN_REQ returning null can also mean that the coordinator is in the middle of another join (DMI.requestPermissionToJoin). These two situations are not the same, hence why I suggest a different treatment.

Cheers,
--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache



More information about the infinispan-dev mailing list