[
https://issues.jboss.org/browse/ISPN-3140?page=com.atlassian.jira.plugin....
]
Adrian Nistor edited comment on ISPN-3140 at 6/7/13 10:51 AM:
--------------------------------------------------------------
Pasted from Adrian's message ([
http://markmail.org/message/ns7aojy7v7su2t7p]):
{quote}
1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it
suppressRehashing?) that is false by default but should also be configurable via API or
xml. While this attribute is true the ClusterTopologyManager queues all
join/leave/exclude(see below) requests and does not execute them on the spot as it would
normally happen. [...] When it is set back to false all queued operations (except the ones
that cancel eachother out) are executed. The setter should be synchronous so when setting
is back to false it does not return until the queue is empty and all rehashing was
processed.
2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...]
This operation removes the node from the topology (almost as if it left) and forces a
rebalance. The node is still present in the current CH but not in the pending CH. It's
basically disowned by all its data which is now being transferred to other (not excluded)
nodes. At the end of the rebalance the node is removed from topology for good and can be
shut down without loosing data. Note that if suppressRehashing==true operation
excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions
and then re-activate the rehashing.
The parts that need to be implemented are written in italic above. Everything else is
already there.
excludeNodes is a way of achieving a soft shutdown and should be used only if we care
about preserving data int the extreme case where the nodes are the last/single owners. We
can just kill the node directly if we do not care about its data.
suppressRehashing is a way of achieving some kind of batching of topology changes. This
should speed up state transfer a lot because it avoids a lot of pointless reshuffling of
data segments when we have many successive joiners/leavers.
So what happens if the current coordinator dies for whatever reason? The new one will take
control and will not have knowledge of the existing rehash queue or the previous status of
suppressRehashing attribute so it will just get the current cache membership status from
all members of current view and proceed with the rehashing as usual. If the user does not
want this he can set a default value of true for suppressRehashing. The admin has to
interact now via JMX with the new coordinator. But that's not as bad as the
alternative where all the nodes are involved in this jmx scheme :) I think having only the
coordinator involved in this is a plus.
{quote}
We're actually going to implement only point 1 now, and point 2 will be a separate
issue (or perhaps as a part of ISPN-1394).
was (Author: dan.berindei):
Pasted from Adrian's message ([
http://markmail.org/message/ns7aojy7v7su2t7p]):
{quote}
1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it
suppressRehashing?) that is false by default but should also be configurable via API or
xml. While this attribute is true the ClusterTopologyManager queues all
join/leave/exclude(see below) requests and does not execute them on the spot as it would
normally happen. [...] When it is set back to false all queued operations (except the ones
that cancel eachother out) are executed. The setter should be synchronous so when setting
is back to false it does not return until the queue is empty and all rehashing was
processed.
2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...]
This operation removes the node from the topology (almost as if it left) and forces a
rebalance. The node is still present in the current CH but not in the pending CH. It's
basically disowned by all its data which is now being transferred to other (not excluded)
nodes. At the end of the rebalance the node is removed from topology for good and can be
shut down without loosing data. Note that if suppressRehashing==false operation
excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions
and then re-activate the rehashing.
The parts that need to be implemented are written in italic above. Everything else is
already there.
excludeNodes is a way of achieving a soft shutdown and should be used only if we care
about preserving data int the extreme case where the nodes are the last/single owners. We
can just kill the node directly if we do not care about its data.
suppressRehashing is a way of achieving some kind of batching of topology changes. This
should speed up state transfer a lot because it avoids a lot of pointless reshuffling of
data segments when we have many successive joiners/leavers.
So what happens if the current coordinator dies for whatever reason? The new one will take
control and will not have knowledge of the existing rehash queue or the previous status of
suppressRehashing attribute so it will just get the current cache membership status from
all members of current view and proceed with the rehashing as usual. If the user does not
want this he can set a default value of true for suppressRehashing. The admin has to
interact now via JMX with the new coordinator. But that's not as bad as the
alternative where all the nodes are involved in this jmx scheme :) I think having only the
coordinator involved in this is a plus.
{quote}
We're actually going to implement only point 1 now, and point 2 will be a separate
issue (or perhaps as a part of ISPN-1394).
JMX operation to suppress state transfer
----------------------------------------
Key: ISPN-3140
URL:
https://issues.jboss.org/browse/ISPN-3140
Project: Infinispan
Issue Type: Feature Request
Components: Distributed Cache, State transfer
Affects Versions: 5.2.6.Final
Reporter: Manik Surtani
Assignee: Mircea Markus
Fix For: 5.3.0.Final
This feature request is to expose a JMX operation on each node, to suppress state
transfer for a period of time. This flag would be {{false}} by default.
The use case of this flag would be to ease bringing down (and up) a cluster for
maintenance work. A typical workflow would be:
1) Shut down application requests to the data grid
2) Suppress state transfer on all nodes via JMX
3) Bring down all nodes
4) Perform maintenance work
5) Bring up nodes, one at a time. As each node comes up, disable state transfer for the
node via JMX.
6) Once all nodes are up, enable state transfer for each node again via JMX
7) Allow application requests to reach the grid again.
The purpose of this is to allow smooth and fast shutdown and startup, remove the risk of
OOM errors (when bringing a grid down).
This is a small but useful subset of full manual state transfer as defined in ISPN-1394.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira