]
Osamu Nagano commented on JGRP-2086:
------------------------------------
This test scenario is to kill a set of servers, not a crash of a host. So FD_HOST is not
triggered. I can't reproduce it too but the user is still reporting it several times.
What kind of info should be collected to investigate this issue? Maybe we can ask them for
it.
FD_SOCK is keep trying to create a new socket to the killed server
------------------------------------------------------------------
Key: JGRP-2086
URL:
https://issues.jboss.org/browse/JGRP-2086
Project: JGroups
Issue Type: Bug
Affects Versions: 3.6.3
Environment: JDG 6.6.0 (jgroups-3.6.3.Final-redhat-4.jar)
Reporter: Osamu Nagano
Assignee: Bela Ban
Fix For: 3.6.11, 4.0
In most cases FD_SOCK can detect a killed server immediately. But for unknown reason,
FD_SOCK is keep trying to create a new socket to the killed server. As a consequence,
installing a new cluster view is delayed until FD_ALL is triggered.
m04_n007_server.log is showing the behaviour. There is 28 nodes (4 machines (m03, ...,
m06) and 7 nodes (n001, ..., n007) on each) and all nodes on m03 are killed at the same
time on 15:07:34,543. FD_SOCK is keep trying to connect to a killed node saying
"socket address for m03_n001/clustered could not be fetched, retrying".
{noformat}
[n007] 15:07:39,543 TRACE [org.jgroups.protocols.FD_SOCK] (Timer-8,shared=udp)
m04_n007/clustered: broadcasting SUSPECT message (suspected_mbrs=[m03_n005/clustered,
m03_n007/clustered])
[n007] 15:07:39,544 TRACE [org.jgroups.protocols.FD_SOCK] (INT-20,shared=udp)
m04_n007/clustered: received SUSPECT message from m04_n007/clustered:
suspects=[m03_n005/clustered, m03_n007/clustered]
[n007] 15:07:39,546 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: socket address for m03_n001/clustered could
not be fetched, retrying
[n007] 15:07:40,546 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: ping_dest is m03_n001/clustered,
pingable_mbrs=[m03_n001/clustered, m03_n002/clustered, m03_n003/clustered,
m03_n004/clustered, m03_n006/clustered, m06_n001/clustered, m06_n002/clustered,
m06_n003/clustered, m06_n004/clustered, m06_n005/clustered, m06_n006/clustered,
m06_n007/clustered, m05_n001/clustered, m05_n002/clustered, m05_n003/clustered,
m05_n004/clustered, m05_n005/clustered, m05_n006/clustered, m05_n007/clustered,
m04_n001/clustered, m04_n002/clustered, m04_n003/clustered, m04_n004/clustered,
m04_n005/clustered, m04_n006/clustered, m04_n007/clustered]
[n007] 15:07:41,546 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: socket address for m03_n001/clustered could
not be fetched, retrying
[n007] 15:07:42,546 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: ping_dest is m03_n001/clustered,
pingable_mbrs=[m03_n001/clustered, m03_n002/clustered, m03_n003/clustered,
m03_n004/clustered, m03_n006/clustered, m06_n001/clustered, m06_n002/clustered,
m06_n003/clustered, m06_n004/clustered, m06_n005/clustered, m06_n006/clustered,
m06_n007/clustered, m05_n001/clustered, m05_n002/clustered, m05_n003/clustered,
m05_n004/clustered, m05_n005/clustered, m05_n006/clustered, m05_n007/clustered,
m04_n001/clustered, m04_n002/clustered, m04_n003/clustered, m04_n004/clustered,
m04_n005/clustered, m04_n006/clustered, m04_n007/clustered]
[n007] 15:07:43,547 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: socket address for m03_n001/clustered could
not be fetched, retrying
...
[n007] 15:10:53,700 DEBUG [org.jgroups.protocols.FD_ALL] (Timer-26,shared=udp)
haven't received a heartbeat from m03_n005/clustered for 200059 ms, adding it to
suspect list
{noformat}
From the TRACE log, you can find an address cache of FD_SOCK has only 23 members.
{noformat}
[n007] 14:40:50,471 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: got cache from m03_n005/clustered: cache is
{
m04_n006/clustered=172.20.66.34:9945,
m05_n005/clustered=172.20.66.35:9938,
m06_n004/clustered=172.20.66.36:9931,
m03_n007/clustered=172.20.66.33:9952,
m05_n001/clustered=172.20.66.35:9910,
m06_n005/clustered=172.20.66.36:9938,
m05_n006/clustered=172.20.66.35:9945,
m03_n005/clustered=172.20.66.33:9938,
m05_n004/clustered=172.20.66.35:9931,
m04_n003/clustered=172.20.66.34:9924,
m04_n007/clustered=172.20.66.34:9952,
m05_n002/clustered=172.20.66.35:9917,
m05_n003/clustered=172.20.66.35:9924,
m04_n004/clustered=172.20.66.34:9931,
m06_n001/clustered=172.20.66.36:9910,
m06_n007/clustered=172.20.66.36:9952,
m04_n005/clustered=172.20.66.34:9938,
m04_n001/clustered=172.20.66.34:9910,
m05_n007/clustered=172.20.66.35:9952,
m06_n002/clustered=172.20.66.36:9917,
m06_n006/clustered=172.20.66.36:9945,
m04_n002/clustered=172.20.66.34:9917,
m06_n003/clustered=172.20.66.36:9924}
{noformat}
While pingable_mbrs has all 28 members which is from the current available cluster view.
{noformat}
[n007] 14:40:50,472 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK
pinger,m04_n007/clustered) m04_n007/clustered: ping_dest is m03_n005/clustered,
pingable_mbrs=[
m03_n005/clustered,
m03_n007/clustered,
m03_n001/clustered,
m03_n002/clustered,
m03_n003/clustered,
m03_n004/clustered,
m03_n006/clustered,
m06_n001/clustered,
m06_n002/clustered,
m06_n003/clustered,
m06_n004/clustered,
m06_n005/clustered,
m06_n006/clustered,
m06_n007/clustered,
m05_n001/clustered,
m05_n002/clustered,
m05_n003/clustered,
m05_n004/clustered,
m05_n005/clustered,
m05_n006/clustered,
m05_n007/clustered,
m04_n001/clustered,
m04_n002/clustered,
m04_n003/clustered,
m04_n004/clustered,
m04_n005/clustered,
m04_n006/clustered,
m04_n007/clustered]
{noformat}