If we only want to deal with full cluster shutdown, then
I think stopping all application requests, calling
Cache.clear() on one node, and then shutting down all the
nodes should be simpler. On start, assuming no cache store,
the caches will start empty, so starting all the nodes at
once and only allowing application requests when they've all
joined should also work without extra work.
If we only want to stop a part of the cluster,
suppressing rebalancing would be better, because we wouldn't
lose all the data. But we'd still lose the keys whose owners
are all among the nodes we want to stop. I've discussed this
with Adrian, and we think if we want to stop a part of the
cluster without losing data we need a JMX operation on the
coordinator that will "atomically" remove a set of nodes
from the CH. After the operation completes, the user will
know it's safe to stop those nodes without losing data.
When it comes to starting a part of the cluster, a "pause
rebalancing" option would probably be better - but again, on
the coordinator, not on each joining node. And clearly, if
more than numOwner nodes leave while rebalancing is
suspended, data will be lost.