[mod_cluster-issues] [JBoss JIRA] (MODCLUSTER-407) worker-timeout can cause httpd thread stalls

Wed May 14 16:51:56 EDT 2014

    [ https://issues.jboss.org/browse/MODCLUSTER-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967979#comment-12967979 ] 

Aaron Ogburn commented on MODCLUSTER-407:
-----------------------------------------

Fortunately there's only two places in mod_proxy_cluster that call apr_sleep: proxy_cluster_watchdog_func and find_best_worker.

proxy_cluster_watchdog_func wouldn't be relevant since that's background/periodic functions and not request related.  Checking out find_best_worker, it is indeed called by pre_request.  Also it is recursive, so it seems to be the only thing that meets all criteria for a culprit that is recursive, called by pre_request, and calling apr_sleep.  Also, the problem spot is going to be called only when nodes are in error state (which we have when killing nodes) so find_best_worker certainly looks like the culprit.

For further clarification, I added debug messages to find_best_worker indicating when it starts and ends and also when it starts and ends a recursive loop.  After stopping request load, debug logging shows threads are continuing to start new recursive loops long after incoming requests stopped.   It looks like relying on the balancer->timeout alone to determine whether we recurse or not is bad logic since multiple threads can get through at once.  Then once one thread finishes a recursive loop, it sets balancer->timeout back, which can cause other threads to make another recursive loop.  Once a few threads get in some recursive loops like that, they can keep each other stuck by continually resetting balancer->timeout and tripping one another back into the recursive loop.

> worker-timeout can cause httpd thread stalls
> --------------------------------------------
>
>                 Key: MODCLUSTER-407
>                 URL: https://issues.jboss.org/browse/MODCLUSTER-407
>             Project: mod_cluster
>          Issue Type: Bug
>    Affects Versions: 1.2.8.Final
>            Reporter: Aaron Ogburn
>            Assignee: Jean-Frederic Clere
>
> Setting a modcluster worker-timeout can stall requests and threads on the httpd side when the requests are received with workers in a down state.  A stack of the problem thread looks like the following (recursive loops through mod_proxy_cluster from #160 to #2):
> #0  0x00007ff8eb547533 in select () from /lib64/libc.so.6
> #1  0x00007ff8eba39185 in apr_sleep () from /usr/lib64/libapr-1.so.0
> #2  0x00007ff8e84be0d1 in ?? () from /etc/httpd/modules/mod_proxy_cluster.so
> ...
> #160 0x00007ff8e84beb9f in ?? () from /etc/httpd/modules/mod_proxy_cluster.so
> #161 0x00007ff8e88d2116 in proxy_run_pre_request () from /etc/httpd/modules/mod_proxy.so
> #162 0x00007ff8e88d9186 in ap_proxy_pre_request () from /etc/httpd/modules/mod_proxy.so
> #163 0x00007ff8e88d63c2 in ?? () from /etc/httpd/modules/mod_proxy.so

--
This message was sent by Atlassian JIRA
(v6.2.3#6260)