[
https://issues.jboss.org/browse/MODCLUSTER-407?page=com.atlassian.jira.pl...
]
Aaron Ogburn commented on MODCLUSTER-407:
-----------------------------------------
Fortunately there's only two places in mod_proxy_cluster that call apr_sleep:
proxy_cluster_watchdog_func and find_best_worker.
proxy_cluster_watchdog_func wouldn't be relevant since that's background/periodic
functions and not request related. Checking out find_best_worker, it is indeed called by
pre_request. Also it is recursive, so it seems to be the only thing that meets all
criteria for a culprit that is recursive, called by pre_request, and calling apr_sleep.
Also, the problem spot is going to be called only when nodes are in error state (which we
have when killing nodes) so find_best_worker certainly looks like the culprit.
For further clarification, I added debug messages to find_best_worker indicating when it
starts and ends and also when it starts and ends a recursive loop. After stopping request
load, debug logging shows threads are continuing to start new recursive loops long after
incoming requests stopped. It looks like relying on the balancer->timeout alone to
determine whether we recurse or not is bad logic since multiple threads can get through at
once. Then once one thread finishes a recursive loop, it sets balancer->timeout back,
which can cause other threads to make another recursive loop. Once a few threads get in
some recursive loops like that, they can keep each other stuck by continually resetting
balancer->timeout and tripping one another back into the recursive loop.
worker-timeout can cause httpd thread stalls
--------------------------------------------
Key: MODCLUSTER-407
URL:
https://issues.jboss.org/browse/MODCLUSTER-407
Project: mod_cluster
Issue Type: Bug
Affects Versions: 1.2.8.Final
Reporter: Aaron Ogburn
Assignee: Jean-Frederic Clere
Setting a modcluster worker-timeout can stall requests and threads on the httpd side when
the requests are received with workers in a down state. A stack of the problem thread
looks like the following (recursive loops through mod_proxy_cluster from #160 to #2):
#0 0x00007ff8eb547533 in select () from /lib64/libc.so.6
#1 0x00007ff8eba39185 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x00007ff8e84be0d1 in ?? () from /etc/httpd/modules/mod_proxy_cluster.so
...
#160 0x00007ff8e84beb9f in ?? () from /etc/httpd/modules/mod_proxy_cluster.so
#161 0x00007ff8e88d2116 in proxy_run_pre_request () from /etc/httpd/modules/mod_proxy.so
#162 0x00007ff8e88d9186 in ap_proxy_pre_request () from /etc/httpd/modules/mod_proxy.so
#163 0x00007ff8e88d63c2 in ?? () from /etc/httpd/modules/mod_proxy.so
--
This message was sent by Atlassian JIRA
(v6.2.3#6260)