[infinispan-issues] [JBoss JIRA] (ISPN-11373) XSite backup commands should be sent from a blocking thread

Mon Mar 2 12:34:23 EST 2020

     [ https://issues.redhat.com/browse/ISPN-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pedro Ruivo updated ISPN-11373:
-------------------------------
        Status: Resolved  (was: Pull Request Sent)
    Resolution: Done


> XSite backup commands should be sent from a blocking thread
> -----------------------------------------------------------
>
>                 Key: ISPN-11373
>                 URL: https://issues.redhat.com/browse/ISPN-11373
>             Project: Infinispan
>          Issue Type: Enhancement
>          Components: Core
>    Affects Versions: 9.4.18.Final, 10.1.2.Final, 11.0.0.Alpha1
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>            Priority: Major
>             Fix For: 9.4.19.Final
>
>
> XSite backup commands usually need more processing on the receiving site than local cluster commands do on the receiving node, which means there's a much higher chance of {{channel.send(message)}} to block.
> {{UFC}}, {{UFC_NB}}, {{MFC}} and {{MFC_NB}} all block when there are not enough credits.
> The _NB variants have an additional queue as a safety net, but that only delays the blocking: it's the same as increasing {{max_credits}} by {{max_queue_size}}, except with less work for {{UNICAST3}}/{{NAKACK2}}.
> {{TCP}} and {{UDP}} also block if their send buffer is full. Using a bundler like {{transfer-queue}} instead of the default {{no-bundler}} will only delay the blocking until the bundler's queue is also full.
> The biggest problem is when xsite backup commands are sent from a jgroups thread, and {{channel.send(message)}} blocks the thread. If the jgroups thread pool becomes full, it cannot process more messages, not even responses from the remote site.
> JGroups creates temporary threads to process internal messages when its thread pool is full, but not even that can help when the other nodes' thread pools are also full:
> {noformat}
> "jgroups-temp-thread-5728,_ma267mlvjdg015:dal_mcom_perf" #11443 prio=5 os_prio=0 tid=0x000000000906f800 nid=0x26cb waiting on condition [0x00007fb0b7b0a000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000005f3bce048> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
>     at org.jgroups.protocols.TransferQueueBundler.send(TransferQueueBundler.java:97)
>     at org.jgroups.protocols.TP.send(TP.java:1441)
>     at org.jgroups.protocols.TP._send(TP.java:1195)
>     at org.jgroups.protocols.TP.down(TP.java:1111)
>     ...
>     at org.jgroups.protocols.FlowControl.sendCredit(FlowControl.java:480)
>     at org.jgroups.protocols.FlowControl.handleCreditRequest(FlowControl.java:469)
>     at org.jgroups.protocols.FlowControl.handleUpEvent(FlowControl.java:379)
>     at org.jgroups.protocols.FlowControl.up(FlowControl.java:350)
> {noformat}


--
This message was sent by Atlassian Jira
(v7.13.8#713008)