[mod_cluster-dev] Handling crashed/hung AS nodes

Fri Mar 27 17:14:52 EDT 2009

Paul Ferraro wrote:
> On Fri, 2009-03-27 at 09:15 -0500, Brian Stansberry wrote:
>> jean-frederic clere wrote:
>>> Paul Ferraro wrote:
>>>> Currently, the HAModClusterService (where httpd communication is
>>>> coordinated by an HA singleton) does not react to crashed/hung members.
>>>> Specifically, when the HA singleton gets a callback that the group
>>>> membership changes, it does not send any REMOVE-APP messages to httpd on
>>>> behalf of the member that just left.  Currently, httpd will detect the
>>>> failure (via a disconnected socket) on its own and sets its internal
>>>> state accordingly, e.g. a STATUS message will return NOTOK.
>>>>
>>>> The non-handling of dropped members is actually a good thing in the
>>>> event of a network partition, where communication between nodes is lost,
>>>> but communication between httpd and the nodes is unaffected.  If we were
>>>> handling dropped members, we would have to handle the ugly scenario
>>>> described here:
>>>> https://jira.jboss.org/jira/browse/MODCLUSTER-66
>>>>
>>>> Jean-Frederic: a few questions...
>>>> 1. Is it up to the AS to drive the recovery of a NOTOK node when it
>>>> becomes functional again?
>>> Yes.
>>>
>>>>  In the case of a crashed member, fresh
>>>> CONFIG/ENABLE-APP messages will be sent upon node restart.  In the case
>>>> of a re-merged network partition, no additional messages are sent.  Is
>>>> the subsequent STATUS message (with a non-zero lbfactor) enough to
>>>> trigger the recovery of this node?
>>> Yes.
> 
> Good to know.
>  
>>>> 2. Can httpd detect hung nodes?  A hung node will not affect the
>>>> connected state of the AJP/HTTP/S connector - it could only detect this
>>>> by sending data to the connector and timing out on the response.
>>> The hung node will be detected and marked as broken but the 
>>> corresponding request(s) may be delayed or lost due to time-out.
>>>
>> How long does this take, say in a typical case where the hung node was 
>> up and running with a pool of AJP connections open? Is it the 10 secs, 
>> the default value of the "ping" property listed at 
>> https://www.jboss.org/mod_cluster/java/properties.html#proxy ?
> 
> I think he's talking about "nodeTimeout".

Yes.

> 
>> Also, if a request is being handled by a hung node and the 
>> HAModClusterService tells httpd to stop that node, the request will 
>> fail, yes? It shouldn't just fail over, as it may have already caused 
>> the transfer of my $1,000,000 to my secret account at UBS. Failing over 
>> would cause transfer of a second $1,000,000 and sadly I don't have that 
>> much.
> 
> Not unlike those damn double-clickers...

maxAttempts allows you to control the number of retry (maxAttemps = 0 
means no retry.).

Cheers

Jean-Frederic