----- Original Message -----
From: "Dan Berindei" <dan.berindei(a)gmail.com>
>> numOwners==2 is and will very likely remain the most common case,
>> particularly for small clusters.
>>
>> But if we have two sites, it makes sense to configure 2 owners per
>> site. If only one node goes down, the surviving owner will supply
>> state to the new owner. If both nodes go down, the new owners will
>> fetch the data from the other site. So while 2 nodes going down
>> will
>> be quite costly, it should be infrequent enough that it's worth
>> optimizing for the more frequent "1 node goes down and than comes
>> back
>> up" case.
>>
> Agreed; this mixed batching (leaves with joins) makes sense for
> non-site clusters as well.
>
Yup, even with numOwners == 2 you could say that 2 nodes dying in 1
minute is highly improbable. So you can delay rehashes triggered by
leaves by 1 minute just in case the node comes back up.
>> > For total shutdown, I guess we can use other means that rehash,
>> > e.g. a specific command that would disable it and start flushing
>> > to the store.
>> >
>>
>> I think just stopping the cache is enough to get it to flush data
>> to
>> the store with passivation enabled.
> ATM, wouldn't the shutdown of a cluster of servers trigger a rehash
> storm?
>
Right. I was commenting only on the cache store part, which should be
completely orthogonal to graceful shutdown.
I think we need a mechanism to better handle partial cluster
shutdown,
and that mechanism can be used for full shutdown as well.
+1. Shuting down a
cluster can be done much more effcient than stopping individual nodes in sequence.