]
RH Bugzilla Integration updated MODCLUSTER-398:
-----------------------------------------------
Bugzilla References:
mod_cluster deadlock in a jboss/windows environment
---------------------------------------------------
Key: MODCLUSTER-398
URL:
https://issues.jboss.org/browse/MODCLUSTER-398
Project: mod_cluster
Issue Type: Bug
Security Level: Public(Everyone can see)
Affects Versions: 1.2.6.Final
Environment: Windows 2008, EAP6 and EWS2.0.1
Reporter: Marc Maurer
Assignee: Jean-Frederic Clere
Fix For: 1.3.1.Final, 1.2.9.Final
Under load Apache stops serving pages, with all threads are stuck in "W : Sending
reply" state. With the windows Process Explorer we then got a stacktrace from a
hanging thread. We don't have debug symbols, but it's easy enough to see
what's happening:
ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a
ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!NtDeleteFile+0x3c4
ntoskrnl.exe!PsDereferenceKernelStack+0x35358
ntoskrnl.exe!KeSynchronizeExecution+0x3a23
ntdll.dll!ZwLockFile+0xa
KERNELBASE.dll!LockFileEx+0xb2
kernel32.dll!LockFileEx+0x1b
libapr-1.dll!apr_file_lock+0x69 <-- here
mod_slotmem.so+0x1318 <-- here
mod_manager.so+0x2a11 <-- here
mod_proxy_cluster.so+0x679e
mod_proxy.so!proxy_run_post_request+0x4e
mod_proxy.so!proxy_run_request_status+0x924
libhttpd.dll!ap_run_handler+0x35
libhttpd.dll!ap_invoke_handler+0x114
libhttpd.dll!ap_die+0x2ea
libhttpd.dll!ap_psignature+0x1ae8
libhttpd.dll!ap_run_process_connection+0x35
libhttpd.dll!ap_process_connection+0x3b
libhttpd.dll!ap_regkey_value_remove+0x136e
msvcrt.dll!srand+0x93
msvcrt.dll!ftime64_s+0x1dd
kernel32.dll!BaseThreadInitThunk+0xd
ntdll.dll!RtlUserThreadStart+0x21
So mod_manager is requesting a filelock on one of the lockfiles in in the MemManagerFile
path. In this case it was the "manager.sessionid.sessionid.lock" file. Removing
the lockfile fixed the problem.
When bisecting the mod_cluster code, I think commit
"74eeb9c026380deb8d833be53b09b3d808e02d10 - Lock in insert-update" in version
1.2.2 is the culprit. This would also explain why mod_cluster 1.2.1 is the last known
working version.
What we don't know, is which process is already holding the lock when all Apache
threads start blocking on it. We are trying to figure that out. There are no obviously
wrong lock/unlock slotmem call pairs in the mod_manager module, and no locks are requested
within other locks as far as we can see. Therefor our best guess would be a deadlock on a
thread already holding the globalmutex_lock in combination with the slotmem file locks,
but that's just a guess without debugging it.
More context can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1080047