[mod_cluster-dev] Handling crashed/hung AS nodes

Fri Mar 27 17:31:36 EDT 2009

>>> 2. Can httpd detect hung nodes?  A hung node will not affect the
>>> connected state of the AJP/HTTP/S connector - it could only detect this
>>> by sending data to the connector and timing out on the response.
>>
>> The hung node will be detected and marked as broken but the 
>> corresponding request(s) may be delayed or lost due to time-out.
>>
> 
> How long does this take, say in a typical case where the hung node was 
> up and running with a pool of AJP connections open? Is it the 10 secs, 
> the default value of the "ping" property listed at 
> https://www.jboss.org/mod_cluster/java/properties.html#proxy ?

cping/cpong is done in the Connector I was thinking of nodeTimeout.

> 
> Also, if a request is being handled by a hung node and the 
> HAModClusterService tells httpd to stop that node, the request will 
> fail, yes? It shouldn't just fail over, as it may have already caused 
> the transfer of my $1,000,000 to my secret account at UBS. Failing over 
> would cause transfer of a second $1,000,000 and sadly I don't have that 
> much.

maxAttempts = 0 controls that.

> 
>>>
>>> And some questions for open discussion:
>>> What does HAModClusterService really buy us over the normal
>>> ModClusterService?  Do the benefits outweigh the complexity?
>>>  * Maintains a uniform view of proxy status across each AS node
>>>  * Can detect and send STOP-APP/REMOVE-APP messages on behalf of
>>> hung/crashed nodes (if httpd cannot already do this) (not yet
>>> implemented)
>>>    + Requires special handling of network partitions
>>>  * Potentially improve scalability by minimizing network traffic for
>>> very large clusters.
> 
> Assume a near-term goal is to run a 150 node cluster with say 10 httpd 
> servers. Assume the background thread runs every 10 seconds. That comes 
> to 150 connections per second across the cluster being opened/closed to 
> handle STATUS. Each httpd server handles 15 connections per second.

That is very little :-)

> 
> With HAModClusterService the way it is now, you get the same, because 
> besides STATUS each node also checks its ability to communicate w/ each 
> httpd in order to validate its ability to become master. But let's 
> assume we add some complexity to allow that health check to become much 
> more infrequent. So ignore those ping checks. So, w/ HAModClusterService 
> you get 1 connection/sec being opened closed across the cluster for 
> status, 0.1 connection/sec per httpd.  But the STATUS request sent 
> across each connection has a much bigger payload.
> 
> How significant is the cost of opening/closing all those connections?

They are keepalived http connections so it is only receive / send on the 
socket.

Cheers

Jean-Frederic