[infinispan-issues] [JBoss JIRA] (ISPN-11407) XSite backup commands should be sent from a blocking thread

Pedro Ruivo (Jira) issues at jboss.org
Mon Mar 2 12:34:22 EST 2020


Pedro Ruivo created ISPN-11407:
----------------------------------

             Summary: XSite backup commands should be sent from a blocking thread
                 Key: ISPN-11407
                 URL: https://issues.redhat.com/browse/ISPN-11407
             Project: Infinispan
          Issue Type: Enhancement
          Components: Core
    Affects Versions: 9.4.18.Final, 10.1.2.Final, 11.0.0.Alpha1
            Reporter: Dan Berindei
            Assignee: Dan Berindei
             Fix For: 9.4.19.Final


XSite backup commands usually need more processing on the receiving site than local cluster commands do on the receiving node, which means there's a much higher chance of {{channel.send(message)}} to block.

{{UFC}}, {{UFC_NB}}, {{MFC}} and {{MFC_NB}} all block when there are not enough credits.
The _NB variants have an additional queue as a safety net, but that only delays the blocking: it's the same as increasing {{max_credits}} by {{max_queue_size}}, except with less work for {{UNICAST3}}/{{NAKACK2}}.

{{TCP}} and {{UDP}} also block if their send buffer is full. Using a bundler like {{transfer-queue}} instead of the default {{no-bundler}} will only delay the blocking until the bundler's queue is also full.

The biggest problem is when xsite backup commands are sent from a jgroups thread, and {{channel.send(message)}} blocks the thread. If the jgroups thread pool becomes full, it cannot process more messages, not even responses from the remote site.

JGroups creates temporary threads to process internal messages when its thread pool is full, but not even that can help when the other nodes' thread pools are also full:

{noformat}
"jgroups-temp-thread-5728,_ma267mlvjdg015:dal_mcom_perf" #11443 prio=5 os_prio=0 tid=0x000000000906f800 nid=0x26cb waiting on condition [0x00007fb0b7b0a000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000005f3bce048> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
    at org.jgroups.protocols.TransferQueueBundler.send(TransferQueueBundler.java:97)
    at org.jgroups.protocols.TP.send(TP.java:1441)
    at org.jgroups.protocols.TP._send(TP.java:1195)
    at org.jgroups.protocols.TP.down(TP.java:1111)
    ...
    at org.jgroups.protocols.FlowControl.sendCredit(FlowControl.java:480)
    at org.jgroups.protocols.FlowControl.handleCreditRequest(FlowControl.java:469)
    at org.jgroups.protocols.FlowControl.handleUpEvent(FlowControl.java:379)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:350)
{noformat}




--
This message was sent by Atlassian Jira
(v7.13.8#713008)


More information about the infinispan-issues mailing list