[jboss-jira] [JBoss JIRA] (JGRP-1467) synchronism between FD and UDP protocols

Wed May 30 06:10:18 EDT 2012

     [ https://issues.jboss.org/browse/JGRP-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban resolved JGRP-1467.
----------------------------

    Resolution: Out of Date

I don't support 2.4.x. If you see the same problem in 3.x, then re-open this issue.

> synchronism between FD and UDP protocols
> ----------------------------------------
>
>                 Key: JGRP-1467
>                 URL: https://issues.jboss.org/browse/JGRP-1467
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>         Environment: RHEL 4.8 32/64 bits
>            Reporter: Pablo Estebanez
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> We've been suffering from problems with our jgroup cluster. We have 3 nodes, A (172.20.177.13:36441), B (172.20.177.14:55150) and C (172.20.177.15:47943), being A coordinator. B begans to suspect C because are-you-alive message is not properly received (but C has sent it!!). 
> These are B traces:
> 2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 172.20.177.15:47943 (own address=172.20.177.14:55150)
> 2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.14:55150), headers are {FD=[FD: heartbeat], UDP=[channel_name=AxisPartition]}
> 2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] [172.20.177.14:55150]: received no heartbeat ack from 172.20.177.15:47943 for 17 times (340000 milliseconds), suspecting it
> 2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[172.20.177.15:47943]] to group
> 2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to null (src=172.20.177.14:55150), headers are {FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)], UDP=[channel_name=AxisPartition]}
> 2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] task done
> 2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943
> You can see UDP messages (105 bytes) from C node, one millisecond after B sent its are-you-alive. But FD protocol is saying that no heartbeat ack was received
> And these ones are C's:
> 2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] received are-you-alive from 172.20.177.14:55150, sending response
> 2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are {FD=[FD: heartbeat ack], UDP=[channel_name=AxisPartition]}
> 2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=AxisPartition], FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]}
> 2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
> 2012-05-17 13:56:59,248 WARN  [org.jgroups.protocols.FD] I was suspected by 172.20.177.14:55150; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
> 2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are {FD=[FD: heartbeat ack], UDP=[channel_name=AxisPartition]}
> 2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.13:36441
> 2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.15:47943, src: 172.20.177.13:36441 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=AxisPartition], VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD]}
> 2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.FD] received msg from 172.20.177.13:36441 (counts as ack)
> 2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.13:36441 (src=172.20.177.15:47943), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: I_AM_NOT_DEAD], UDP=[channel_name=AxisPartition]}
> You can see C is ignoring suspecting message. But heartbeat_ack is not being processed by B.
> Last, these are A's traces:
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=AxisPartition], FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]}
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] verifying that 172.20.177.15:47943 is dead
> 2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.13:36441), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD], UDP=[channel_name=AxisPartition]}
> 2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.15:47943
> 2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.13:36441, src: 172.20.177.15:47943 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=AxisPartition], VERIFY_SUSPECT=[VERIFY_SUSPECT: I_AM_NOT_DEAD]}
> 2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] member 172.20.177.15:47943 is not dead !
> 2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD] member is 172.20.177.15:47943
> 2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD_SOCK] member is 172.20.177.15:47943
> A detected a wrong suspect, and consecuently, the cluster goes on having the three members. But B is not working properly, and every message the remainder of the nodes sent to it, is not received. So that, the cluster is losing messages and hence the users are being affected.
> Furthermore, because the cluster is OK for the coordinator, there is no way to know that B is not working. I have reviewed every MBean regarding de cluster and in all of them the cluster is OK, with the three members.
> Any issue?
> Is there any way to detect that B is not working at all?
> Thanks in advance.
> Best Regards,
> Pablo.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira