Paul Ferraro wrote:
On Fri, 2009-03-27 at 09:15 -0500, Brian Stansberry wrote:
> jean-frederic clere wrote:
>> Paul Ferraro wrote:
>>> Currently, the HAModClusterService (where httpd communication is
>>> coordinated by an HA singleton) does not react to crashed/hung members.
>>> Specifically, when the HA singleton gets a callback that the group
>>> membership changes, it does not send any REMOVE-APP messages to httpd on
>>> behalf of the member that just left. Currently, httpd will detect the
>>> failure (via a disconnected socket) on its own and sets its internal
>>> state accordingly, e.g. a STATUS message will return NOTOK.
>>>
>>> The non-handling of dropped members is actually a good thing in the
>>> event of a network partition, where communication between nodes is lost,
>>> but communication between httpd and the nodes is unaffected. If we were
>>> handling dropped members, we would have to handle the ugly scenario
>>> described here:
>>>
https://jira.jboss.org/jira/browse/MODCLUSTER-66
>>>
>>> Jean-Frederic: a few questions...
>>> 1. Is it up to the AS to drive the recovery of a NOTOK node when it
>>> becomes functional again?
>> Yes.
>>
>>> In the case of a crashed member, fresh
>>> CONFIG/ENABLE-APP messages will be sent upon node restart. In the case
>>> of a re-merged network partition, no additional messages are sent. Is
>>> the subsequent STATUS message (with a non-zero lbfactor) enough to
>>> trigger the recovery of this node?
>> Yes.
Good to know.
>>> 2. Can httpd detect hung nodes? A hung node will not affect the
>>> connected state of the AJP/HTTP/S connector - it could only detect this
>>> by sending data to the connector and timing out on the response.
>> The hung node will be detected and marked as broken but the
>> corresponding request(s) may be delayed or lost due to time-out.
>>
> How long does this take, say in a typical case where the hung node was
> up and running with a pool of AJP connections open? Is it the 10 secs,
> the default value of the "ping" property listed at
>
https://www.jboss.org/mod_cluster/java/properties.html#proxy ?
I think he's talking about "nodeTimeout".
Yes.
> Also, if a request is being handled by a hung node and the
> HAModClusterService tells httpd to stop that node, the request will
> fail, yes? It shouldn't just fail over, as it may have already caused
> the transfer of my $1,000,000 to my secret account at UBS. Failing over
> would cause transfer of a second $1,000,000 and sadly I don't have that
> much.
Not unlike those damn double-clickers...
maxAttempts allows you to control the number of retry (maxAttemps = 0
means no retry.).
Cheers
Jean-Frederic