[infinispan-dev] design for cluster events (wiki page)

Fri Nov 1 03:25:44 EDT 2013

On 10/31/13 11:20 PM, Sanne Grinovero wrote:
> On 31 October 2013 20:07, Mircea Markus <mmarkus at redhat.com> wrote:
>>
>> On Oct 31, 2013, at 3:45 PM, Dennis Reed <dereed at redhat.com> wrote:
>>
>>> On 10/31/2013 02:18 AM, Bela Ban wrote:
>>>>
>>>>> Also if we did have read only, what criteria would cause those nodes
>>>>> to be writeable again?
>>>> Once you become the primary partition, e.g. when a view is received
>>>> where view.size() >= N where N is a predefined threshold. Can be
>>>> different, as long as it is deterministic.
>>>>
>>>>> There is no guarantee when the other nodes
>>>>> will ever come back up or if there will ever be additional ones anytime soon.
>>>> If a system picks the Primary Partition approach, then it can become
>>>> completely inaccessible (read-only). In this case, I envisage that a
>>>> sysadmin will be notified, who can then start additional nodes for the
>>>> system to acquire primary partition and become accessible again.
>>>
>>> There should be a way to manually modify the primary partition status.
>>> So if the admin knows the nodes will never return, they can manually
>>> enable the partition.
>>
>> The status will be exposed through JMX at any point, disregarding if there's a split brain going on or not.
>>
>>>
>>> Also, the PartitionContext should know whether the nodes left normally
>>> or not.
>>> If you have 5 nodes in a cluster, and you shut down 3 of them, you'll
>>> want the remaining two to remain available.
>>> But if there was a network partition, you wouldn't.  So it needs to know
>>> the difference.
>>
>> very good point again.
>> Thank you Dennis!
>
> Let's clarify. If 3 nodes out of 5 are killed without a
> reconfiguration, you do NOT want the remaining two to remain available
> unless explicitly told so by an admin. It is not possible to
> automatically make a distinction between 3 nodes being shut down vs. 3
> crashed nodes.

We could determine that a node left *gracefully* by sending an RPC 
before leaving. But for all other cases, we don't know whether a node 
got partitioned away, or whether it crashed.

For the graceful-leave case, we could say that we can go below the 
read-only threshold to remain available. This would increase overall 
availability a bit.

> In our face to face meeting we did point out that an admin needs hooks
> to be able to:
>   - specify how many nodes are expected in the full system (and adapt
> dynamically)
>   - some admin command to "clean shutdown" a node (which was also
> discussed as a strong requirement in scope of CacheStores so I'm
> assuming the operation is defined already)
>
> The design Wiki has captured the API we discussed around the
> PartitionHandlingStrategy but is missing the details about these
> operations, that should probably be added to the PartitionContext as
> well.
>
> Also in the scope of CacheStore consistency we had discussed the need
> to store the expected nodes to be in the View: for example when the
> grid is started and all nodes are finding each other, the Cache shall
> not be considered started until all required nodes have joined.

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)