]
Dan Berindei commented on ISPN-11373:
-------------------------------------
I also opened a master PR (
), but I
closed it without merging because IRAC uses OOB messages even for asynchronous backups, so
the sender is much less likely to run out of credits.
Obviously {{JChannel.send()}} can still block when the TCP/UDP send buffer is full.
XSite backup commands should be sent from a blocking thread
-----------------------------------------------------------
Key: ISPN-11373
URL:
https://issues.redhat.com/browse/ISPN-11373
Project: Infinispan
Issue Type: Enhancement
Components: Core
Affects Versions: 9.4.18.Final, 10.1.2.Final, 11.0.0.Alpha1
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Major
Fix For: 9.4.19.Final
XSite backup commands usually need more processing on the receiving site than local
cluster commands do on the receiving node, which means there's a much higher chance of
{{channel.send(message)}} to block.
{{UFC}}, {{UFC_NB}}, {{MFC}} and {{MFC_NB}} all block when there are not enough credits.
The _NB variants have an additional queue as a safety net, but that only delays the
blocking: it's the same as increasing {{max_credits}} by {{max_queue_size}}, except
with less work for {{UNICAST3}}/{{NAKACK2}}.
{{TCP}} and {{UDP}} also block if their send buffer is full. Using a bundler like
{{transfer-queue}} instead of the default {{no-bundler}} will only delay the blocking
until the bundler's queue is also full.
The biggest problem is when xsite backup commands are sent from a jgroups thread, and
{{channel.send(message)}} blocks the thread. If the jgroups thread pool becomes full, it
cannot process more messages, not even responses from the remote site.
JGroups creates temporary threads to process internal messages when its thread pool is
full, but not even that can help when the other nodes' thread pools are also full:
{noformat}
"jgroups-temp-thread-5728,_ma267mlvjdg015:dal_mcom_perf" #11443 prio=5
os_prio=0 tid=0x000000000906f800 nid=0x26cb waiting on condition [0x00007fb0b7b0a000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005f3bce048> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
at org.jgroups.protocols.TransferQueueBundler.send(TransferQueueBundler.java:97)
at org.jgroups.protocols.TP.send(TP.java:1441)
at org.jgroups.protocols.TP._send(TP.java:1195)
at org.jgroups.protocols.TP.down(TP.java:1111)
...
at org.jgroups.protocols.FlowControl.sendCredit(FlowControl.java:480)
at org.jgroups.protocols.FlowControl.handleCreditRequest(FlowControl.java:469)
at org.jgroups.protocols.FlowControl.handleUpEvent(FlowControl.java:379)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:350)
{noformat}