Handling crashed/hung AS nodes
by Paul Ferraro
Currently, the HAModClusterService (where httpd communication is
coordinated by an HA singleton) does not react to crashed/hung members.
Specifically, when the HA singleton gets a callback that the group
membership changes, it does not send any REMOVE-APP messages to httpd on
behalf of the member that just left. Currently, httpd will detect the
failure (via a disconnected socket) on its own and sets its internal
state accordingly, e.g. a STATUS message will return NOTOK.
The non-handling of dropped members is actually a good thing in the
event of a network partition, where communication between nodes is lost,
but communication between httpd and the nodes is unaffected. If we were
handling dropped members, we would have to handle the ugly scenario
described here:
https://jira.jboss.org/jira/browse/MODCLUSTER-66
Jean-Frederic: a few questions...
1. Is it up to the AS to drive the recovery of a NOTOK node when it
becomes functional again? In the case of a crashed member, fresh
CONFIG/ENABLE-APP messages will be sent upon node restart. In the case
of a re-merged network partition, no additional messages are sent. Is
the subsequent STATUS message (with a non-zero lbfactor) enough to
trigger the recovery of this node?
2. Can httpd detect hung nodes? A hung node will not affect the
connected state of the AJP/HTTP/S connector - it could only detect this
by sending data to the connector and timing out on the response.
And some questions for open discussion:
What does HAModClusterService really buy us over the normal
ModClusterService? Do the benefits outweigh the complexity?
* Maintains a uniform view of proxy status across each AS node
* Can detect and send STOP-APP/REMOVE-APP messages on behalf of
hung/crashed nodes (if httpd cannot already do this) (not yet
implemented)
+ Requires special handling of network partitions
* Potentially improve scalability by minimizing network traffic for
very large clusters.
e.g. non-masters ping httpd less often
* Anything else?
Paul
15 years, 8 months
Problems with Beta4
by Brian Stansberry
Following is a list of issues I saw when playing with Beta4 on Windows.
Apologies if some of these are known issues / already fixed. I'll scan
JIRA now and open issues for any I don't see.
1) Undeploy an app or shut down server, clients with an existing session
do not fail over. Following from access_log shows the issue. Last 404
occurs a couple seconds after the REMOVE-APP, so doesn't seem to be a race.
> 192.168.2.3 - - [16/Mar/2009:16:07:48 +0100] "STOP-APP / HTTP/1.0" 200 -
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 503 1086
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 503 1086
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 503 1086
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 503 1086
> 127.0.0.1 - - [16/Mar/2009:16:07:48 +0100] "GET /load-demo/record HTTP/1.1" 503 1086
> 192.168.2.3 - - [16/Mar/2009:16:07:48 +0100] "REMOVE-APP / HTTP/1.0" 200 -
> 192.168.2.3 - - [16/Mar/2009:16:07:48 +0100] "STATUS / HTTP/1.0" 200 59
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record?destroy=true HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:49 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 192.168.2.3 - - [16/Mar/2009:16:07:49 +0100] "STATUS / HTTP/1.0" 200 59
> 192.168.2.3 - - [16/Mar/2009:16:07:50 +0100] "STATUS / HTTP/1.0" 200 59
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record?destroy=true HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 200 21
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
> 127.0.0.1 - - [16/Mar/2009:16:07:51 +0100] "GET /load-demo/record HTTP/1.1" 404 999
2) When you run with HAModClusterService, every 10 seconds there is
logging about a DRM replicantsChanged event and a new HASingletonMaster
election. (The election just picks the existing master.) That means the
DRM is being updated even when nothing has changed, which shouldn't happen.
3) To get advertise to work, I had to add a AdvertiseGroup
224.0.1.105:23364 directive to httpd.conf. The docs on jboss.org imply
that shouldn't be necessary since the value is just the default.
4) The mod_cluster-manager status page reports Transfered: 0, Connected:
0, Load: 0 Num sessions: 0 for all nodes, always; doesn't ever report
actual data. Also "Transfered" should be "Transferred"
5) The mod_cluster-manager status page "SessionIDs" section lists
session ids, which is a security violation. Jean-Frederic, you mentioned
you wanted to remove this. In case you haven't, I tried to disable it by
setting Maxsessionid 0 in httpd.conf, but that had no effect.
6) Playing with the demo's "Server Load Control" tab I tried to use the
"Heap Memory Use" control. I couldn't get this to have any effect on
load balancing.
a) The servlet isn't multiplying the duration value by 1000 to convert
seconds to ms. I'll fix this in just a sec after I send this.
b) but, even after adjusting for this I couldn't get any load balancing
effect by using "Heap Memory Use". Looking at the process in Task
Manager, it seemed the servlet was increasing heap usage. So I'm
concerned there is an issue with the load metric.
7) Go into jmx-console, jboss.web:service=ModClusterService, invoke the
"disable" operation. Node logs this in server.log:
009-03-16 17:15:58,765 ERROR
[org.jboss.modcluster.mcmp.impl.DefaultMCMPHandler]
(http-192.168.2.2-8080-2) Error [null: null: {4}] sending command DUMP
to proxy 192.168.2.3:6666, configuration will be reset
2009-03-16 17:16:55,250 ERROR
[org.jboss.modcluster.mcmp.impl.DefaultMCMPHandler]
(http-192.168.2.2-8080-2) Error [null: null: {4}] sending command
DISABLE-APP to proxy 192.168.2.3:6666, configuration will be reset
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
15 years, 8 months