[mod_cluster-dev] Handling crashed/hung AS nodes

Thu Mar 26 18:26:39 EDT 2009

Paul Ferraro wrote:
> Currently, the HAModClusterService (where httpd communication is
> coordinated by an HA singleton) does not react to crashed/hung members.
> Specifically, when the HA singleton gets a callback that the group
> membership changes, it does not send any REMOVE-APP messages to httpd on
> behalf of the member that just left.  Currently, httpd will detect the
> failure (via a disconnected socket) on its own and sets its internal
> state accordingly, e.g. a STATUS message will return NOTOK.
> 
> The non-handling of dropped members is actually a good thing in the
> event of a network partition, where communication between nodes is lost,
> but communication between httpd and the nodes is unaffected.  If we were
> handling dropped members, we would have to handle the ugly scenario
> described here:
> https://jira.jboss.org/jira/browse/MODCLUSTER-66
> 
> Jean-Frederic: a few questions...
> 1. Is it up to the AS to drive the recovery of a NOTOK node when it
> becomes functional again?

Yes.

>  In the case of a crashed member, fresh
> CONFIG/ENABLE-APP messages will be sent upon node restart.  In the case
> of a re-merged network partition, no additional messages are sent.  Is
> the subsequent STATUS message (with a non-zero lbfactor) enough to
> trigger the recovery of this node?

Yes.

> 2. Can httpd detect hung nodes?  A hung node will not affect the
> connected state of the AJP/HTTP/S connector - it could only detect this
> by sending data to the connector and timing out on the response.

The hung node will be detected and marked as broken but the 
corresponding request(s) may be delayed or lost due to time-out.

> 
> And some questions for open discussion:
> What does HAModClusterService really buy us over the normal
> ModClusterService?  Do the benefits outweigh the complexity?
>  * Maintains a uniform view of proxy status across each AS node
>  * Can detect and send STOP-APP/REMOVE-APP messages on behalf of
> hung/crashed nodes (if httpd cannot already do this) (not yet
> implemented)
>    + Requires special handling of network partitions
>  * Potentially improve scalability by minimizing network traffic for
> very large clusters.
>    e.g. non-masters ping httpd less often
>  * Anything else?

Well I prefer the JAVA code deciding if a node is broken that httpd. I 
really want to keep the complexity in httpd to minimum and the talk I 
had until now at the ApacheCon seems to show that is probably the best 
way to go.

Cheers

Jean-Frederic

> 
> Paul
> 
> _______________________________________________
> mod_cluster-dev mailing list
> mod_cluster-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/mod_cluster-dev
>