[mod_cluster-dev] Session temporarily unavailable

Wed Jul 22 11:27:38 EDT 2009

On Wed, 2009-07-22 at 17:14 +0200, Bela Ban wrote:
> Paul Ferraro wrote:
> >>> When I have 2 nodes (node1, node2), and a session which is created on
> >>> node1 (and replicated to node2), the following can happen:
> >>>
> >>> * I shut down node1 (CTRL-C)
> >>> * node1 terminates cleanly
> >>> * I access the webapp, but get a "Session temporarily unavailable"
> >>> failure message (I guess a 500)
> >>> * When I check mod-cluster-manager, information about node1 is still
> >>> there !
> >>> * Ca 5 seconds later, mod-cluster-manager doesn't show node1 anymore
> >>> * When I now access the webapp, I get the proper failed over session
> >>> on node2 and all my data is still there
> >>>
> >>>
> >>> So my question is why doesn't node2 immediately tell httpd/mod-cluster
> >>> that node1 is gone ? It seems that httpd *itself* only learns about 
> >>> this
> >>> when it pings the socket to node1...
> > Bela Ban wrote:
> >
> > I should be. Do you see the appropriate STOP_APP, REMOVE_APP messages
> > in your httpd access log?
> 
> I do, right when I kill the node. I think the delay I've been seeing 
> could have been caused by having both ModClusterListener *and* the 
> mod-cluster integration bean in server.xml...
> 
> I don't see this issue anymore (I've also upgraded to mod-cluster 1.0.1 
> in the meantime)... I'll keep trying to reproduce this.
> 
> >>> I recall Paul once telling me we hadn't implemented that functionality,
> >>> but I guess by now this is surely implemented ?
> >
> > I never said that, but I think I know of the conversation to which you 
> > refer...
> > If you are running mod_cluster via HAModClusterService and node1 crashed
> > (instead of a clean shut down), node2 will not send any MCMP messages to
> > httpd on behalf of node1 when it receives a view change.
> 
> Ah, yes, I remember now ! I just killed -9 node1 and mod-cluster-manager 
> didn't remove node1 from its worker list.
> 
> Why don't we do this ? We cannot rely on a node being shutdown gracefully !

Because currently a given node cannot distinguish between a crashed
member and a network partition.

> This could be very simple logic, executed by the singleton:
> 
> On becoming singleton:
> - Get the current view and remove all nodes from httpd which are not in 
> the current view
> 
> On view change:
> - Same as above, with the view shipped as argument to the view change
> 
> We have to think whether this would make sense for node additions, but 
> then again every new node registers itself, so maybe this is not necessary.

In the HASingleton case, the master node performs the registration on
behalf of the new node, but triggered via rpc from the new node, not
view change.

> WDYT ?

See:
https://jira.jboss.org/jira/browse/MODCLUSTER-66