[infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2

Mon Feb 6 01:52:50 EST 2012

On Mon, Feb 6, 2012 at 3:30 AM, Manik Surtani <manik at jboss.org> wrote:
>
> On 5 Feb 2012, at 16:24, Dan Berindei wrote:
>
>> On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani <manik at jboss.org> wrote:
>>>
>>> On 1 Feb 2012, at 12:23, Dan Berindei wrote:
>>>
>>>> Bela, you're right, this is essentially what we talked about in Lisbon:
>>>> https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign
>>>>
>>>> For joins I actually started working on a policy of coalescing joins
>>>> that happen one after the other in a short time interval. The current
>>>> implementation is very primitive, as I shifted focus to stability, but
>>>> it does coalesce joins 1 second after another join started (or while
>>>> that join is still running).
>>>>
>>>> I don't quite agree with Sanne's assessment that it's fine for
>>>> getCache() to block for 5 minutes until the administrator allows the
>>>> new node to join. We should modify startCaches() instead to signal to
>>>> the coordinator that we are ready to receive data for one or all of
>>>> the defined caches, and wait with a customizable time limit until the
>>>> caches have properly joined the cluster.
>>>>
>>>> The getCache() timeout should not be increased at all. Instead I would
>>>> propose that getCache() returns a functional cache immediately, even
>>>> if the cache didn't receive any data, and it works solely as an L1
>>>> cache until the administrator allows it to join. I'd even make it
>>>> possible to designate a cache as an L1-only cache, so it's never an
>>>> owner for any key.
>>>
>>> I presume this would be encoded in the Address?  That would make sense for a node permanently designated as an L1 node.  But then how would this work for a node temporarily acting as L1 only, until it has been allowed to join?  Change the Address instance on the fly?  A delegating Address?  :/
>>>
>>
>> Nope, not in the Address. Since we now have virtual cache views, every
>> node has to explicitly request to join each cache. It would be quite
>> easy to add an "L1-only" flag to the join request, and nodes with that
>> flag would never be included in the proper cache view. They would
>> however get a copy of the cache view on join and on every cache view
>> installation so they could update their CH and send requests to the
>> proper owners.
>>
>> Nodes acting temporarily as L1-only would send a normal join request,
>> but they would also receive a copy of the cache view and getCache()
>> would return immediately instead of waiting for the node to receive
>> state. When the coordinator finally installs a cache view that
>> includes them, they will perform the initial state transfer as they do
>> now.
>
> Makes sense.
>
>>
>>
>>>> For leaves, the main problem is that every node has to compute the
>>>> same primary owner for a key, at all times. So we need a 2PC cache
>>>> view installation immediately after any leave to ensure that every
>>>> node determines the primary owner in the same way - we can't coalesce
>>>> or postpone leaves.
>>>
>>> Yes, manual rehashing would probably just be for joins.  Controlled shutdown in itself is manual, and crashes, well, need to be dealt with immediately IMO.
>>>
>>
>> We could extend the policy for craches as well, by adding a
>> minNumOwners setting and only triggering an automatic rehash when
>> there is a segment on the hash wheel with <= minNumOwners owners.
>>
>> We would have a properly installed CH, that guarantees at some point
>> in the past each key had numOwners owners, and a filter on top of it
>> that removes any leavers from the result of DM.locate(),
>> DM.getPrimaryLocation() etc.
>>
>> It would probably undo our recent optimizations around locate and
>> getPrimaryLocation, so it's slowing the normal case (without any
>> leavers) in order to make the exceptional case (organized shutdown or
>> a part of the cluster) faster. The question is how big the cluster has
>> to get before the exceptional case becomes common enough that it's
>> worth optimizing for…
>
> Well, we could just leave the crashed node in the owners list.  When any caller attempts to do something with this list, i.e., perform an RPC, don't we check again at that time and remove leavers from an RPC recipient list?
>

We do check against the JGroups cluster membership, but we only remove
leavers from the recipient list if the response mode is
SYNCHRONOUS_IGNORE_LEAVERS. If the response mode is SYNCHRONOUS, we
fail the RPC with a SuspectException. If the response mode is
ASYNCHRONOUS, we don't care about the response anyway, so we don't do
anything.

In SYNCHRONOUS mode, we then catch the SuspectExceptions in
StateTransferLockInterceptor and we retry to run the command after
waiting for state transfer to end (even if there is no state transfer
in progress, we know there will be one because of the
SuspectException).

This ensures that we always execute the command on the current primary
owner - if we would not fail PREPAREs when the primary owner is down,
we could have multiple PREPAREs claiming to have locked the same key
at the same time.

Anyway, we can't rely on the JGroups view membership, we need to check
against the cache view membership. When a DefaultCacheManager stops,
it will first stop each cache and only then stop the global components
like JGroupsTransport. So there is a time interval where the cache
will not respond to RPCs but it will still be part of the JGroups
view, even if don't consider the possibility of users stopping a
single cache manually.

Cheers
Dan