[infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2

Wed Feb 1 08:39:22 EST 2012

On Wed, Feb 1, 2012 at 1:46 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
> On 1 February 2012 11:23, Dan Berindei <dan.berindei at gmail.com> wrote:
>> Bela, you're right, this is essentially what we talked about in Lisbon:
>> https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign
>>
>> For joins I actually started working on a policy of coalescing joins
>> that happen one after the other in a short time interval. The current
>> implementation is very primitive, as I shifted focus to stability, but
>> it does coalesce joins 1 second after another join started (or while
>> that join is still running).
>>
>> I don't quite agree with Sanne's assessment that it's fine for
>> getCache() to block for 5 minutes until the administrator allows the
>> new node to join. We should modify startCaches() instead to signal to
>> the coordinator that we are ready to receive data for one or all of
>> the defined caches, and wait with a customizable time limit until the
>> caches have properly joined the cluster.
>>
>> The getCache() timeout should not be increased at all. Instead I would
>> propose that getCache() returns a functional cache immediately, even
>> if the cache didn't receive any data, and it works solely as an L1
>> cache until the administrator allows it to join. I'd even make it
>> possible to designate a cache as an L1-only cache, so it's never an
>> owner for any key.
>
> I agree that would be very nice, but makes it much more complex to
> implement in 5.2 as well: functional L1 means that the other nodes
> must accept this node as part of the grid, including for L1
> invalidation purposes.

I don't think L1 would be a problem, the L1 code doesn't assume that
the requestor is in the CH. That would basically be the only
difference between a "normal" node and a "L1-only" node.

However, I'm sure there will be problems in other areas, so I wouldn't
push this for 5.2.

> So my proposal on blocking until ready is to make a first step, and I
> think it would still be very useful for people wanting to boot some
> ~100 nodes. Blocking the application is not a big deal, as you're
> delaying boot of an application which was likely not even powered on
> before.
> When adding several new nodes, I just want them to "add all at once",
> so preventing intermediate rehashing: until all have joined you should
> block rehash - that's a manual (or more likely automated externally)
> step and will not be engaged for long, nor it would replace normal
> behaviour when disabled.
>
> Actually even more useful would be to start a node with such an
> no-rehash option enabled.
>

The rehash or no-rehash decision has to be made on the coordinator,
he's the one that's going to trigger the rehash.
So any flag that allows or disallows rehashing has to be propagated to
the coordinator, and maybe even to the entire cluster in case the
coordinator goes down.

>>
>>
>> For leaves, the main problem is that every node has to compute the
>> same primary owner for a key, at all times. So we need a 2PC cache
>> view installation immediately after any leave to ensure that every
>> node determines the primary owner in the same way - we can't coalesce
>> or postpone leaves.
>>
>> For 5.2 I will try to decouple the cache view installation from the
>> state transfer, so in theory we will be able to coalesce/postpone the
>> state transfer for leaves as well
>> (https://issues.jboss.org/browse/ISPN-1827). I'm kind of need it for
>> non-blocking state transfer, because with the current implementation a
>> leave forces us to cancel any state transfer in progress and restart
>> with the updated cache view - a state transfer rollback will be very
>> expensive with NBST.
>>
>>
>> Erik does raise a valid point - with TACH, if we bring up a node with
>> a different siteId, then it will be an owner for all the keys in the
>> cache. That node probably isn't provisioned to hold all the keys, so
>> it would very likely run out of memory or evict much of the data. I
>> guess that makes it a 5.2 issue?
>
> That's exactly my reasoning as well. There are situations in which you
> want to add several nodes "once", which is very different than "in
> rapid sequence" as that would storm the network with data shuffling in
> all directions.
>

I don't think the network traffic would be that bad, if the first
rehash takes a long time then all other joiners will queue up and join
at the same time. So we're looking at just one extra rehash, but the
problem is with that first rehash.

>>
>> Shutting down a site should be possible even with what we have now -
>> just insert a DISCARD protocol in the JGroups stack of all the nodes
>> that are shutting down, and when FD finally times out on the nodes in
>> the surviving datacenter they won't have any state transfer to do
>> (although it may cause a few failed state transfer attempts). We could
>> make it simpler though.
>
> Yes, use a pair of scissors ;-)
>
> --Sanne
>
>>
>>
>> Cheers
>> Dan
>>
>>
>> On Tue, Jan 31, 2012 at 6:21 PM, Erik Salter <an1310 at hotmail.com> wrote:
>>> ...such as bringing up a backup data center.
>>>
>>> -----Original Message-----
>>> From: infinispan-dev-bounces at lists.jboss.org
>>> [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Bela Ban
>>> Sent: Tuesday, January 31, 2012 11:18 AM
>>> To: infinispan-dev at lists.jboss.org
>>> Subject: Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2
>>>
>>> I cannot volunteer either, but I find it important to be done in 5.2 !
>>>
>>> Unless rehashing works flawlessly with a large number of nodes joining
>>> at the same time, I think manual rehashing is crucial...
>>>
>>>
>>>
>>> On 1/31/12 5:13 PM, Sanne Grinovero wrote:
>>>> On 31 January 2012 16:06, Bela Ban<bban at redhat.com>  wrote:
>>>>> This is essentially what I suggested at the Lisbon meeting, right ?
>>>>
>>>> Yes!
>>>>
>>>>> I think Dan had a design wiki on this somewhere...
>>>>
>>>> Just rising it here as it was moved to 6.0, while I think it deserves
>>>> a dedicated thread to better think about it. If it's not hard, I think
>>>> it should be done sooner.
>>>> But while I started the thread to wake up the brilliant minds, I can't
>>>> volunteer for this to make it happen.
>>>>
>>>> Sanne
>>>>
>>>>>
>>>>>
>>>>> On 1/31/12 4:53 PM, Sanne Grinovero wrote:
>>>>>> I think this is an important feature to have soon;
>>>>>>
>>>>>> My understanding of it:
>>>>>>
>>>>>> We default with the feature off, and newly discovered nodes are
>>>>>> added/removed as usual. With a JMX operatable switch, one can disable
>>>>>> this:
>>>>>>
>>>>>> If a remote node is joining the JGroups view, but rehash is off: it
>>>>>> will be added to a to-be-installed view, but this won't be installed
>>>>>> until rehash is enabled again. This gives time to add more changes
>>>>>> before starting the rehash, and would help a lot to start larger
>>>>>> clusters.
>>>>>>
>>>>>> If the [self] node is booting and joining a cluster with manual rehash
>>>>>> off, the start process and any getCache() invocation should block and
>>>>>> wait for it to be enabled. This would need of course to override the
>>>>>> usually low timeouts.
>>>>>>
>>>>>> When a node is suspected it's a bit a different story as we need to
>>>>>> make sure no data is lost. The principle is the same, but maybe we
>>>>>> should have two flags: one which is a "soft request" to avoid rehashes
>>>>>> of less than N members (and refuse N>=numOwners ?), one which is just
>>>>>> disable it and don't care: data might be in a cachestore, data might
>>>>>> not be important. Which reminds me, we should consider as well a JMX
>>>>>> command to flush the container to the CacheLoader.
>>>>>>
>>>>>> --Sanne
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> --
>>>>> Bela Ban
>>>>> Lead JGroups (http://www.jgroups.org)
>>>>> JBoss / Red Hat
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Bela Ban
>>> Lead JGroups (http://www.jgroups.org)
>>> JBoss / Red Hat
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev