[jboss-jira] [JBoss JIRA] (JGRP-1863) Excessive dropped messages due to missing physical address

Fri Jul 25 10:05:30 EDT 2014

     [ https://issues.jboss.org/browse/JGRP-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Ferraro closed JGRP-1863.
------------------------------

    Resolution: Cannot Reproduce Bug

I'm going to close this for the time being.  There seems to be an issue on my end with RELAY2 not getting added to the stack.  Did something change in 3.5 relating to programmatic configuration?  In WildFly, the bulk of the protocol stack is created via a Configurator, but RELAY2 is added dynamically to the existing stack via ProtocolStack.addProtocol(...).

> Excessive dropped messages due to missing physical address
> ----------------------------------------------------------
>
>                 Key: JGRP-1863
>                 URL: https://issues.jboss.org/browse/JGRP-1863
>             Project: JGroups
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>    Affects Versions: 3.5
>            Reporter: Paul Ferraro
>            Assignee: Bela Ban
>            Priority: Blocker
>             Fix For: 3.5
>
>
> When running the x-site replication tests (and only those tests - the others run fine) from the clustering testsuite in WildFly against JGroups 3.5, I encounter failures due to:
> {noformat}
> 12:15:48,537 WARN  [org.infinispan.xsite.BackupSenderImpl] (default task-1) ISPN000202: Problems backing up data for cache dist to site SFO: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from SFO (sync, timeout=10000)
> {noformat}
> The logs preceding this indicate the cause of the timeout:
> {noformat}
> 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
> 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
> 12:15:39,506 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
> 12:15:39,507 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp) JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
> {noformat}
> These messages repeat about 100 or so times over a period of 10 seconds.
> A little investigation reveals that the process for fetching physical addresses for a given logical destination address has changed.  In 3.4, a given call to sendToSingleMember(...) would attempt to lookup the physical address by sending a Event.GET_PHYSICAL_ADDRESS up the stack and wait a predetermined period for a response.  Any concurrent calls to sendToSingleMember(...) would also wait, but only one thread in a given time period would ever send the Event.GET_PHYSICAL_ADDRESS event up the stack.
> In 3.5 the process is different.  In org.jgroups.protocols.TP, the FIND_MBRS event is used to lookup the phsyical addresses, instead of directly sending up a GET_PHYSICAL_ADDRESS event.  However, looking at the implementation of the FIND_MBRS event handling within org.jgroups.protocols.Discovery, I see that this triggers a asynchronous GET_MBRS_REQ message.  Since this message is sent asynchronously, this means that the response from the original FIND_MBRS event will most certainly be empty.   Thus the thread that initiated the FIND_MBRS will most certainly log the PhysicalAddrMissing warning, as will any concurrent/subsequent calls to sendToSingleMember(...) for the same destination until that asynchronous processing completes.  This is a departure from the logic in 3.4, where the thread initiating the physical address lookup would wait for some time for the address cache to be updated.  I should think that the PhysicalAddrMissing warnings should stop once the original GET_MBRS_REQ message is handled, but that doesn't seem to be happening (hence the 100 or so sequential warning messages over a period of 10 seconds preceding the timeout log message from infinispan).
> Curiously, I see a org.jgroups.protocols.TP.setPingData(...) method, which seems to be responsible for populating the physical address cache from the FIND_MBRS event results from org.jgroups.protocols.Discovery - however, this method doesn't seem to be referenced anywhere.  Might that be the source of the problem?

--
This message was sent by Atlassian JIRA
(v6.2.6#6264)