[jboss-jira] [JBoss JIRA] (JGRP-1863) Excessive dropped messages due to missing physical address

Monday, 28 July 2014

    [
https://issues.jboss.org/browse/JGRP-1863?page=com.atlassian.jira.plugin....
] 

Paul Ferraro commented on JGRP-1863:
------------------------------------

I do think we will - since I plan to add FORK (whether implicitly or explicitly, I
haven't decided) to our default stacks.  RELAY2 is not in the default stacks, but our
relay tests (which add it to the stack) should exercise the compatibility of the 2 when
the time comes.

...
 Excessive dropped messages due to missing physical address
 ----------------------------------------------------------

                 Key: JGRP-1863
                 URL: https://issues.jboss.org/browse/JGRP-1863
             Project: JGroups
          Issue Type: Bug
      Security Level: Public(Everyone can see) 
    Affects Versions: 3.5
            Reporter: Paul Ferraro
            Assignee: Bela Ban
            Priority: Blocker
             Fix For: 3.5

 When running the x-site replication tests (and only those tests - the others run fine)
from the clustering testsuite in WildFly against JGroups 3.5, I encounter failures due
to:
 {noformat}
 12:15:48,537 WARN  [org.infinispan.xsite.BackupSenderImpl] (default task-1) ISPN000202:
Problems backing up data for cache dist to site SFO:
org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a
response from SFO (sync, timeout=10000)
 {noformat}
 The logs preceding this indicate the cause of the timeout:
 {noformat}
 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp)
JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
 12:15:38,536 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp)
JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
 12:15:39,506 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp)
JGRP000032: null: no physical address for SiteMaster(SFO), dropping message
 12:15:39,507 WARN  [org.jgroups.protocols.UDP] (TransferQueueBundler,shared=udp)
JGRP000032: null: no physical address for SiteMaster(NYC), dropping message
 {noformat}
 These messages repeat about 100 or so times over a period of 10 seconds.
 A little investigation reveals that the process for fetching physical addresses for a
given logical destination address has changed.  In 3.4, a given call to
sendToSingleMember(...) would attempt to lookup the physical address by sending a
Event.GET_PHYSICAL_ADDRESS up the stack and wait a predetermined period for a response. 
Any concurrent calls to sendToSingleMember(...) would also wait, but only one thread in a
given time period would ever send the Event.GET_PHYSICAL_ADDRESS event up the stack.
 In 3.5 the process is different.  In org.jgroups.protocols.TP, the FIND_MBRS event is
used to lookup the phsyical addresses, instead of directly sending up a
GET_PHYSICAL_ADDRESS event.  However, looking at the implementation of the FIND_MBRS event
handling within org.jgroups.protocols.Discovery, I see that this triggers a asynchronous
GET_MBRS_REQ message.  Since this message is sent asynchronously, this means that the
response from the original FIND_MBRS event will most certainly be empty.   Thus the thread
that initiated the FIND_MBRS will most certainly log the PhysicalAddrMissing warning, as
will any concurrent/subsequent calls to sendToSingleMember(...) for the same destination
until that asynchronous processing completes.  This is a departure from the logic in 3.4,
where the thread initiating the physical address lookup would wait for some time for the
address cache to be updated.  I should think that the PhysicalAddrMissing warnings should
stop once the original GET_MBRS_REQ message is handled, but that doesn't seem to be
happening (hence the 100 or so sequential warning messages over a period of 10 seconds
preceding the timeout log message from infinispan).
 Curiously, I see a org.jgroups.protocols.TP.setPingData(...) method, which seems to be
responsible for populating the physical address cache from the FIND_MBRS event results
from org.jgroups.protocols.Discovery - however, this method doesn't seem to be
referenced anywhere.  Might that be the source of the problem? 

--
This message was sent by Atlassian JIRA
(v6.2.6#6264)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1863) Excessive dropped messages due to missing physical address