Re: [mod_cluster-dev] Handling crashed/hung AS nodes

Friday, 27 March 2009

Paul Ferraro wrote:
...
 On Fri, 2009-03-27 at 09:15 -0500, Brian Stansberry wrote:
> jean-frederic clere wrote:
>> Paul Ferraro wrote:
>>> Currently, the HAModClusterService (where httpd communication is
>>> coordinated by an HA singleton) does not react to crashed/hung members.
>>> Specifically, when the HA singleton gets a callback that the group
>>> membership changes, it does not send any REMOVE-APP messages to httpd on
>>> behalf of the member that just left.  Currently, httpd will detect the
>>> failure (via a disconnected socket) on its own and sets its internal
>>> state accordingly, e.g. a STATUS message will return NOTOK.
>>>
>>> The non-handling of dropped members is actually a good thing in the
>>> event of a network partition, where communication between nodes is lost,
>>> but communication between httpd and the nodes is unaffected.  If we were
>>> handling dropped members, we would have to handle the ugly scenario
>>> described here:
>>> https://jira.jboss.org/jira/browse/MODCLUSTER-66
>>>
>>> Jean-Frederic: a few questions...
>>> 1. Is it up to the AS to drive the recovery of a NOTOK node when it
>>> becomes functional again?
>> Yes.
>>
>>>  In the case of a crashed member, fresh
>>> CONFIG/ENABLE-APP messages will be sent upon node restart.  In the case
>>> of a re-merged network partition, no additional messages are sent.  Is
>>> the subsequent STATUS message (with a non-zero lbfactor) enough to
>>> trigger the recovery of this node?
>> Yes.

 Good to know.

>>> 2. Can httpd detect hung nodes?  A hung node will not affect the
>>> connected state of the AJP/HTTP/S connector - it could only detect this
>>> by sending data to the connector and timing out on the response.
>> The hung node will be detected and marked as broken but the 
>> corresponding request(s) may be delayed or lost due to time-out.
>>
> How long does this take, say in a typical case where the hung node was 
> up and running with a pool of AJP connections open? Is it the 10 secs, 
> the default value of the "ping" property listed at 
> https://www.jboss.org/mod_cluster/java/properties.html#proxy ?

 I think he's talking about "nodeTimeout". 
Yes.

...

> Also, if a request is being handled by a hung node and the 
> HAModClusterService tells httpd to stop that node, the request will 
> fail, yes? It shouldn't just fail over, as it may have already caused 
> the transfer of my $1,000,000 to my secret account at UBS. Failing over 
> would cause transfer of a second $1,000,000 and sadly I don't have that 
> much.

 Not unlike those damn double-clickers... 
maxAttempts allows you to control the number of retry (maxAttemps = 0 
means no retry.).

Cheers

Jean-Frederic

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

Re: [mod_cluster-dev] Handling crashed/hung AS nodes