]
Bela Ban commented on JGRP-1532:
--------------------------------
Hard to say what's going on; I don't know NIC teaming.
I've used IP Bonding (on Linux) without any issues before, both in load balance and
failover mode.
Does NIC teaming
- Provide a single virtual address (10.120.180.64) ?
- Is the sender of all packets 10.120.180.64 ?
- Are multicasts to 228.8.8.8 load balanced across the 2 NICs ?
- Why am I seeing traffic from 10.120.180.64, 10.120.120.64 and 10.120.220.64 ? Are these
the virtual addresses of the 3 nodes in the cluster ?
Don't receive heartbeat in Nic Teaming configuration after NIC
switch
---------------------------------------------------------------------
Key: JGRP-1532
URL:
https://issues.jboss.org/browse/JGRP-1532
Project: JGroups
Issue Type: Bug
Affects Versions: 2.12.2
Environment: Windows Server Standard 2008 SP2.
two network cards Broadcom BCM5709S NetXtreme II (DualPort) with NIC-Teaming Software (
BASC3 Version 12.2.9.0. (Broadcom Advanced Control Suite 3)
Reporter: PASCAL BROUWET
Assignee: Bela Ban
Fix For: 3.3
we haven't problems in single cards configuration without NIC Teaming.
But with all machines with dual cards with Nic Teaming is activated, we have a problem of
"didn't received heartbeat".
With WireShark analyser, we observed that when heartbeat Multicast packet stay on same
card, we did not have problem but if the heartbeat Multicast packet switches to second
card, we have in logs file failure detections.
For example : the first heartfailure in logs appears at 03:41:25 until 05:03:20
2012-10-23 03:41:25.234 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 11061 ms, adding it to suspect list
2012-10-23 03:41:25.234 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0),
ctc804291084-11401(de9a6a421087)]
2012-10-23 03:41:28.245 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 14072 ms, adding it to suspect list
2012-10-23 03:41:28.245 [FINE] - FD_ALL: haven't received a heartbeat from
ctc804291084-11401(de9a6a421087) for 12044 ms, adding it to suspect list
2012-10-23 03:41:28.245 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0),
ctc804291084-11401(de9a6a421087)]
2012-10-23 03:41:31.255 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 17082 ms, adding it to suspect list
2012-10-23 03:41:31.255 [FINE] - FD_ALL: haven't received a heartbeat from
ctc804291084-11401(de9a6a421087) for 15054 ms, adding it to suspect list
2012-10-23 03:41:31.255 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0),
ctc804291084-11401(de9a6a421087)]
2012-10-23 03:41:34.266 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 20093 ms, adding it to suspect list
2012-10-23 03:41:34.266 [FINE] - FD_ALL: haven't received a heartbeat from
ctc804291084-11401(de9a6a421087) for 18065 ms, adding it to suspect list
2012-10-23 03:41:34.266 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0),
ctc804291084-11401(de9a6a421087)]
2012-10-23 03:41:37.277 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 23104 ms, adding it to suspect list
2012-10-23 03:41:37.277 [FINE] - FD_ALL: haven't received a heartbeat from
ctc804291084-11401(de9a6a421087) for 21076 ms, adding it to suspect list
2012-10-23 03:41:37.277 [FINE] - FD_ALL: suspecting [ctc809091084-27510(5ae571864ef0),
ctc804291084-11401(de9a6a421087)]
2012-10-23 03:41:40.288 [FINE] - FD_ALL: haven't received a heartbeat from
ctc809091084-27510(5ae571864ef0) for 26115 ms, adding it to suspect list
2012-10-23 03:41:40.288 [FINE] - FD_ALL: haven't received a heartbeat from
ctc804291084-11401(de9a6a421087) for 24087 ms, adding it to suspect list
...
the logs of Card 1 during the period :
----------------------------------------------------
2012-10-23 03:41:15.563 MULTICAST id=321 src=/10.120.180.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:15.996 MULTICAST id=7481 src=/10.120.120.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 04:25:49.221 MULTICAST id=2868 src=/10.120.180.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
The Cards was in standby between 03:41:15 and 04:25:49
The logs of Card 0 during the period :
-------------------------------------------------
----------------------------------------------------
2012-10-23 03:41:25.029 MULTICAST id=74b1 src=/10.120.120.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:25.961 MULTICAST id=5adb src=/10.120.220.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:26.874 MULTICAST id=5ae0 src=/10.120.220.64:45588 dest=/228.8.8.8:45588
(91 bytes)
Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
flags=[OOB]
headers=[
PingHeader:[PING: type=GET_MBRS_REQ, cluster=REPL,
view_id=[f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0)|2]]
]
----------------------------------------------------
2012-10-23 03:41:27.607 MULTICAST id=362 src=/10.120.180.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:28.040 MULTICAST id=74bf src=/10.120.120.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=17da3e81-158b-4440-50c7-412aebce41e2(de9a6a421087) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:28.962 MULTICAST id=5ae8 src=/10.120.220.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=f1e9fdac-6d36-d321-6f9d-ec0cbf771608(5ae571864ef0) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
----------------------------------------------------
2012-10-23 03:41:30.617 MULTICAST id=36f src=/10.120.180.64:45588 dest=/228.8.8.8:45588
(47 bytes)
Msg1 src=cc74a22f-6e18-1b7a-5521-3abebdd47ab6(3ba17876e725) dest=ALL
flags=[OOB]
headers=[
HeartbeatHeader:heartbeat
]
etc ... heartbeats received every 3 secondes until 06:00
The two cards have been configured with the same IP Address (10.120.180.64) and also
virtual NIC (10.120.180.64).
We tested with Mcast.exe on these configuration without problems.
All is working like JGroups (or JAVA) was "plugged" only the card n°1.
JGroups was been configured with this parameters.
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="urn:org:jgroups">
<UDP bind_addr="10.120.180.64" bind_interface="eth10"
bind_port="7800" diagnostics_addr="224.0.75.75"
discard_incompatible_packets="true" enable_bundling="true"
enable_diagnostics="true" ip_ttl="10" loopback="true"
max_bundle_size="64K" max_bundle_timeout="30"
mcast_group_addr="228.8.8.8" mcast_port="45588"
mcast_recv_buf_size="25M" mcast_send_buf_size="640K"
oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.max_threads="8" oob_thread_pool.min_threads="1"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Run" singleton_name="UDP"
thread_naming_pattern="pl" thread_pool.enabled="true"
thread_pool.keep_alive_time="5000" thread_pool.max_threads="8"
thread_pool.min_threads="2" thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100" thread_pool.rejection_policy="Run"
tos="8" ucast_recv_buf_size="20M"
ucast_send_buf_size="640K"/>
<PING num_initial_members="3" timeout="2000"/>
<MERGE2 max_interval="30000" min_interval="10000"/>
<FD_SOCK bind_addr="10.120.180.64" bind_interface="eth10"/>
<FD_ALL/>
<VERIFY_SUSPECT bind_addr="10.120.180.64" bind_interface="eth10"
timeout="1500"/>
<pbcast.NAKACK discard_delivered_msgs="false"
exponential_backoff="150" gc_lag="0"
retransmit_timeout="300,600,1200" use_mcast_xmit="true"
use_stats_for_retransmission="false"/>
<UNICAST timeout="300,600,1200"/>
<pbcast.STABLE desired_avg_gossip="50000" max_bytes="4M"
stability_delay="1000"/>
<pbcast.GMS join_timeout="5000" print_local_addr="true"
view_bundling="true"/>
<UFC max_credits="2M" min_threshold="0.4"/>
<MFC max_credits="2M" min_threshold="0.4"/>
<FRAG2 frag_size="60K"/>
<pbcast.STREAMING_STATE_TRANSFER bind_addr="10.120.180.64"
bind_interface="eth10" bind_port="7810"
socket_buffer_size="16384" use_default_transport="false"/>
</config>
Have you ever heard about NIC teaming problems ?
Thanks.
Pascal BROUWET
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: