Yes, ISPN-1394 has a broader scope but the proposed solution for ISPN-3140 solves quite a lot of ISPN-1394 and it's not complex. We might not even need ISPN-1394 soon unless somebody really wants to control data ownership down to segment granularity. If we only want to batch joins/leaves and manually kick out nodes with or without loosing their data then this proposal should be enough. This solution should not prevent implementation of ISPN-1394 in future and will not need to be removed/undone.

Here are the details:

1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it suppressRehashing?) that is false by default but should also be configurable via API or xml. While this attribute is true the ClusterTopologyManager queues all join/leave/exclude(see below) requests and does not execute them on the spot as it would normally happen. The value of this attribute is ignored on all nodes but the coordinator. When it is set back to false all queued operations (except the ones that cancel eachother out) are executed. The setter should be synchronous so when setting is back to false it does not return until the queue is empty and all rehashing was processed.

2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. Calling this method on any node but the coordinator is no-op. This operation removes the node from the topology (almost as if it left) and forces a rebalance. The node is still present in the current CH but not in the pending CH. It's basically disowned by all its data which is now being transferred to other (not excluded) nodes. At the end of the rebalance the node is removed from topology for good and can be shut down without loosing data. Note that if suppressRehashing==false operation excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions and then re-activate the rehashing.

The parts that need to be implemented are written in italic above. Everything else is already there.

excludeNodes is a way of achieving a soft shutdown and should be used only if we care about preserving data int the extreme case where the nodes are the last/single owners. We can just kill the node directly if we do not care about its data.

suppressRehashing is a way of achieving some kind of batching of topology changes. This should speed up state transfer a lot because it avoids a lot of pointless reshuffling of data segments when we have many successive joiners/leavers.

So what happens if the current coordinator dies for whatever reason? The new one will take control and will not have knowledge of the existing rehash queue or the previous status of suppressRehashing attribute so it will just get the current cache membership status from all members of current view and proceed with the rehashing as usual. If the user does not want this he can set a default value of true for suppressRehashing. The admin has to interact now via JMX with the new coordinator. But that's not as bad as the alternative where all the nodes are involved in this jmx scheme :) I think having only the coordinator involved in this is a plus.

Manik, how does this fit for the full and partial shutdown?

Cheers
Adi

On 05/31/2013 04:20 PM, Manik Surtani wrote: