[jboss-jira] [JBoss JIRA] (JGRP-1863) Excessive dropped messages due to missing physical address

Thu Jul 24 13:21:30 EDT 2014

     [ https://issues.jboss.org/browse/JGRP-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Ferraro updated JGRP-1863:
-------------------------------

    Description: 
When running the x-site replication tests from the clustering testsuite in WildFly, I encounter failures due to:
{noformat}
12:15:48,537 WARN  [org.infinispan.xsite.BackupSenderImpl] (default task-1) ISPN000202: Problems backing up data for cache dist to site SFO: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from SFO (sync, timeout=10000)
{noformat}
The logs preceding this indicate the cause of the timeout:
{noformat}
12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
12:15:39,506 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
12:15:39,507 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
{noformat}
These messages repeat about 100 or so times over a period of 10 seconds.

A little investigation reveals that the process for fetching physical addresses for a given logical destination address has changed.  In 3.4, a given call to sendToSingleMember(...) would attempt to lookup the physical address by sending a Event.GET_PHYSICAL_ADDRESS up the stack and wait a predetermined period for a response.  Any concurrent calls to sendToSingleMember(...) would also wait, but only one thread in a given time period would ever send the Event.GET_PHYSICAL_ADDRESS event up the stack.

In 3.5 the process is different.  In org.jgroups.protocols.TP, the FIND_MBRS event is used to lookup the phsyical addresses, instead of directly sending up a GET_PHYSICAL_ADDRESS event.  However, looking at the implementation of the FIND_MBRS event handling within org.jgroups.protocols.Discovery, I see that this triggers a asynchronous GET_MBRS_REQ message.  Since this message is sent asynchronously, this means that the response from the original FIND_MBRS event will most certainly be empty.   Thus the thread that initiated the FIND_MBRS will most certainly log the PhysicalAddrMissing warning, as will any concurrent/subsequent calls to sendToSingleMember(...) for the same destination until that asynchronous processing completes.  This is a departure from the logic in 3.4, where the thread initiating the physical address lookup would wait for some time for the address cache to be updated.  I should think that the PhysicalAddrMissing warnings should stop once the original GET_MBRS_REQ message is handled, but that doesn't seem to be happening (hence the 100 or so sequential warning messages over a period of 10 seconds preceding the timeout log message from infinispan).

Curiously, I see a org.jgroups.protocols.TP.setPingData(...) method, which seems to be responsible for populating the physical address cache from the FIND_MBRS event results from org.jgroups.protocols.Discovery - however, this method doesn't seem to be referenced anywhere.  Might that be the source of the problem?

  was:
When running the x-site replication tests from the clustering testsuite in WildFly, I encounter failures due to:
{noformat}
12:15:48,537 WARN  [org.infinispan.xsite.BackupSenderImpl] (default task-1) ISPN000202: Problems backing up data for cache dist to site SFO: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from SFO (sync, timeout=10000)
{noformat}
The logs preceding this indicate the cause of the timeout:
{noformat}
12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
12:15:39,506 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
12:15:39,507 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
{noformat}
These messages repeat about 100 times.

A little investigation reveals that the process for fetching physical addresses for a given logical destination address has changed.  In 3.4, a given call to sendToSingleMember(...) would attempt to lookup the physical address by sending a Event.GET_PHYSICAL_ADDRESS up the stack and wait a predetermined period for a response.  Any concurrent calls to sendToSingleMember(...) would also wait, but only one thread in a given time period would ever send the Event.GET_PHYSICAL_ADDRESS event up the stack.

In 3.5 the process is different.  In org.jgroups.protocols.TP, the FIND_MBRS event is used to lookup the phsyical addresses, instead of directly sending up a GET_PHYSICAL_ADDRESS event.  However, looking at the implementation of the FIND_MBRS event handling within org.jgroups.protocols.Discovery, I see that this triggers a asynchronous GET_MBRS_REQ message.  Since this message is sent asynchronously, this means that the response from the original FIND_MBRS event will most certainly be empty.   Thus the thread that initiated the FIND_MBRS will most certainly log the PhysicalAddrMissing warning, as will any concurrent/subsequent calls to sendToSingleMember(...) for the same destination until that asynchronous processing completes.  This is a departure from the logic in 3.4, where the thread initiating the physical address lookup would wait for some time for the address cache to be updated.  I should think that the PhysicalAddrMissing warnings should stop once the original GET_MBRS_REQ message is handled, but that doesn't seem to be happening (hence the 100 or so sequential warning messages over a period of 10 seconds preceding the timeout log message from infinispan).

Curiously, I see a org.jgroups.protocols.TP.setPingData(...) method, which seems to be responsible for populating the physical address cache from the FIND_MBRS event results from org.jgroups.protocols.Discovery - however, this method doesn't seem to be referenced anywhere.  Might that be the source of the problem?

> Excessive dropped messages due to missing physical address
> ----------------------------------------------------------
>
>                 Key: JGRP-1863
>                 URL: https://issues.jboss.org/browse/JGRP-1863
>             Project: JGroups
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>    Affects Versions: 3.5
>            Reporter: Paul Ferraro
>            Assignee: Bela Ban
>            Priority: Blocker
>             Fix For: 3.5
>
>
> When running the x-site replication tests from the clustering testsuite in WildFly, I encounter failures due to:
> {noformat}
> 12:15:48,537 WARN  [org.infinispan.xsite.BackupSenderImpl] (default task-1) ISPN000202: Problems backing up data for cache dist to site SFO: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from SFO (sync, timeout=10000)
> {noformat}
> The logs preceding this indicate the cause of the timeout:
> {noformat}
> 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
> 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
> 12:15:39,506 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
> 12:15:39,507 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
> {noformat}
> These messages repeat about 100 or so times over a period of 10 seconds.
> A little investigation reveals that the process for fetching physical addresses for a given logical destination address has changed.  In 3.4, a given call to sendToSingleMember(...) would attempt to lookup the physical address by sending a Event.GET_PHYSICAL_ADDRESS up the stack and wait a predetermined period for a response.  Any concurrent calls to sendToSingleMember(...) would also wait, but only one thread in a given time period would ever send the Event.GET_PHYSICAL_ADDRESS event up the stack.
> In 3.5 the process is different.  In org.jgroups.protocols.TP, the FIND_MBRS event is used to lookup the phsyical addresses, instead of directly sending up a GET_PHYSICAL_ADDRESS event.  However, looking at the implementation of the FIND_MBRS event handling within org.jgroups.protocols.Discovery, I see that this triggers a asynchronous GET_MBRS_REQ message.  Since this message is sent asynchronously, this means that the response from the original FIND_MBRS event will most certainly be empty.   Thus the thread that initiated the FIND_MBRS will most certainly log the PhysicalAddrMissing warning, as will any concurrent/subsequent calls to sendToSingleMember(...) for the same destination until that asynchronous processing completes.  This is a departure from the logic in 3.4, where the thread initiating the physical address lookup would wait for some time for the address cache to be updated.  I should think that the PhysicalAddrMissing warnings should stop once the original GET_MBRS_REQ message is handled, but that doesn't seem to be happening (hence the 100 or so sequential warning messages over a period of 10 seconds preceding the timeout log message from infinispan).
> Curiously, I see a org.jgroups.protocols.TP.setPingData(...) method, which seems to be responsible for populating the physical address cache from the FIND_MBRS event results from org.jgroups.protocols.Discovery - however, this method doesn't seem to be referenced anywhere.  Might that be the source of the problem?

--
This message was sent by Atlassian JIRA
(v6.2.6#6264)