<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jul 30, 2014 at 12:00 PM, Pedro Ruivo <span dir="ltr">&lt;<a href="mailto:pedro@infinispan.org" target="_blank">pedro@infinispan.org</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><br>

<br>

On 07/30/2014 09:02 AM, Dan Berindei wrote:<br>

&gt;<br>

<br>

&gt;<br>

&gt; if your proposal is only meant to apply to non-tx caches, you are right<br>

&gt; you don&#39;t have to worry about multiple primary owners... most of the<br>

&gt; time. But when the primary owner changes, then you do have 2 primary<br>

&gt; owners (if the new primary owner installs the new topology first), and<br>

&gt; you do need to coordinate between the 2.<br>

&gt;<br>

<br>

</div>I think it is the same for transactional cache. I.e. the commands wait<br>

for the transaction data from the new topology to be installed. In the<br>

non-tx caches, the old primary owner will send the next &quot;sequence<br>

number&quot; to the new primary owner and only after that, the new primary<br>

owner starts to give the orders.<br></blockquote><div><br></div><div>I&#39;m not sure that&#39;s related, commands that wait for a newer topology do not block a thread since the ISPN-3527 fix.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

Otherwise, I can implement a total order version for non-tx caches and<br>

all the write serialization would be done in JGroups and Infinispan only<br>

has to apply the updates as soon as they are delivered.<br></blockquote><div><br></div><div>Right, that sounds quite interesting. But you&#39;d also need a less-blocking state transfer ;)</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><div class="h5"><br>

&gt; Slightly related: we also considered generating a version number on the<br>

&gt; client for consistency when the HotRod client retries after a primary<br>

&gt; owner failure [1]. But the clients can&#39;t create a monotonic sequence<br>

&gt; number, so we couldn&#39;t use that version number for this.<br>

&gt;<br>

&gt; [1] <a href="https://issues.jboss.org/browse/ISPN-2956" target="_blank">https://issues.jboss.org/browse/ISPN-2956</a><br>

&gt;<br>

&gt;<br>

&gt;     Also I don&#39;t see it as an alternative to TOA, I rather expect it to<br>

&gt;     work nicely together: when TOA is enabled you could trust the<br>

&gt;     originating sequence source rather than generate a per-entry sequence,<br>

&gt;     and in neither case you need to actually use a Lock.<br>

&gt;     I haven&#39;t thought how the sequences would need to interact (if they<br>

&gt;     need), but they seem complementary to resolve different aspects, and<br>

&gt;     also both benefit from the same cleanup and basic structure.<br>

&gt;<br>

&gt;<br>

&gt; We don&#39;t acquire locks at all on the backup owners - either in tx or<br>

&gt; non-tx caches. If state transfer is in progress, we use<br>

&gt; ConcurrentHashMap.compute() to store tracking information, which uses a<br>

&gt; synchronized block, so I suppose we do acquire locks. I assume your<br>

&gt; proposal would require a DataContainer.compute() or something similar on<br>

&gt; the backups, to ensure that the version check and the replacement are<br>

&gt; atomic.<br>

&gt;<br>

&gt; I still think TOA does what you want for tx caches. Your proposal would<br>

&gt; only work for non-tx caches, so you couldn&#39;t use them together.<br>

&gt;<br>

&gt;<br>

&gt;      &gt;&gt; Another aspect is that the &quot;user thread&quot; on the primary owner<br>

&gt;     needs to<br>

&gt;      &gt;&gt; wait (at least until we improve further) and only proceed after ACK<br>

&gt;      &gt;&gt; from backup nodes, but this is better modelled through a state<br>

&gt;      &gt;&gt; machine. (Also discussed in Farnborough).<br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt; To be clear, I don&#39;t think keeping the user thread on the<br>

&gt;     originator blocked<br>

&gt;      &gt; until we have the write confirmations from all the backups is a<br>

&gt;     problem - a<br>

&gt;      &gt; sync operation has to block, and it also serves to rate-limit user<br>

&gt;      &gt; operations.<br>

&gt;<br>

&gt;<br>

&gt;     There are better ways to rate-limit than to make all operations slow;<br>

&gt;     we don&#39;t need to block a thread, we need to react on the reply from<br>

&gt;     the backup owners.<br>

&gt;     You still have an inherent rate-limit in the outgoing packet queues:<br>

&gt;     if these fill up, then and only then it&#39;s nice to introduce some back<br>

&gt;     pressure.<br>

&gt;<br>

&gt;<br>

&gt; Sorry, you got me confused when you called the thread on the primary<br>

&gt; owner a &quot;user thread&quot;. I agree that internal stuff can and should be<br>

&gt; asynchronous, callback based, but the user still has to see a<br>

&gt; synchronous blocking operation.<br>

&gt;<br>

&gt;<br>

&gt;      &gt; The problem appears when the originator is not the primary owner,<br>

&gt;     and the<br>

&gt;      &gt; thread blocking for backup ACKs is from the remote-executor pool<br>

&gt;     (or OOB,<br>

&gt;      &gt; when the remote-executor pool is exhausted).<br>

&gt;<br>

&gt;     Not following. I guess this is out of scope now that I clarified the<br>

&gt;     proposed solution is only to be applied between primary and backups?<br>

&gt;<br>

&gt;<br>

&gt; Yeah, I was just trying to clarify that there is no danger of exhausting<br>

&gt; the remote executor/OOB thread pools when the originator of the write<br>

&gt; command is the primary owner (as it happens in the HotRod server).<br>

&gt;<br>

&gt;<br>

&gt;      &gt;&gt;<br>

&gt;      &gt;&gt; It&#39;s also conceptually linked to:<br>

&gt;      &gt;&gt;  - <a href="https://issues.jboss.org/browse/ISPN-1599" target="_blank">https://issues.jboss.org/browse/ISPN-1599</a><br>

&gt;      &gt;&gt; As you need to separate the locks of entries from the effective user<br>

&gt;      &gt;&gt; facing lock, at least to implement transactions on top of this<br>

&gt;     model.<br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt; I think we fixed ISPN-1599 when we changed passivation to use<br>

&gt;      &gt; DataContainer.compute(). WDYT Pedro, is there anything else you&#39;d<br>

&gt;     like to do<br>

&gt;      &gt; in the scope of ISPN-1599?<br>

&gt;      &gt;<br>

&gt;      &gt;&gt;<br>

&gt;      &gt;&gt; I expect this to improve performance in a very significant way, but<br>

&gt;      &gt;&gt; it&#39;s getting embarrassing that it&#39;s still not done; at the next face<br>

&gt;      &gt;&gt; to face meeting we should also reserve some time for retrospective<br>

&gt;      &gt;&gt; sessions.<br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt; Implementing the state machine-based interceptor stack may give us a<br>

&gt;      &gt; performance boost, but I&#39;m much more certain that it&#39;s a very<br>

&gt;     complex, high<br>

&gt;      &gt; risk task... and we don&#39;t have a stable test suite yet :)<br>

&gt;<br>

&gt;     Cleaning up and removing some complexity such as<br>

&gt;     TooManyExecutorsException might help to get it stable, and keep it<br>

&gt;     there :)<br>

&gt;     BTW it was quite stable for me until you changed the JGroups UDP<br>

&gt;     default configuration.<br>

&gt;<br>

&gt;<br>

&gt; Do you really use UDP to run the tests? The default is TCP, but maybe<br>

&gt; the some tests doesn&#39;t use TestCacheManagerFactory...<br>

&gt;<br>

&gt; I was just aligning our configs with Bela&#39;s recommandations: MERGE3<br>

&gt; instead of MERGE2 and the removal of UFC in TCP stacks. If they cause<br>

&gt; problems on your machine, you should make more noise :)<br>

&gt;<br>

&gt; Dan<br>

&gt;<br>

&gt;     Sanne<br>

&gt;<br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt;&gt;<br>

&gt;      &gt;&gt;<br>

&gt;      &gt;&gt; Sanne<br>

&gt;      &gt;&gt;<br>

&gt;      &gt;&gt; On 29 July 2014 15:50, Bela Ban &lt;<a href="mailto:bban@redhat.com">bban@redhat.com</a><br>

</div></div><div class="">&gt;     &lt;mailto:<a href="mailto:bban@redhat.com">bban@redhat.com</a>&gt;&gt; wrote:<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt; On 29/07/14 16:42, Dan Berindei wrote:<br>

&gt;      &gt;&gt; &gt;&gt; Have you tried regular optimistic/pessimistic transactions as<br>

&gt;     well?<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt; Yes, in my first impl. but since I&#39;m making only 1 change per<br>

&gt;     request, I<br>

&gt;      &gt;&gt; &gt; thought a TX is overkill.<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt;&gt; They *should* have less issues with the OOB thread pool than<br>

&gt;     non-tx<br>

&gt;      &gt;&gt; &gt;&gt; mode, and<br>

&gt;      &gt;&gt; &gt;&gt; I&#39;m quite curious how they stack against TO in such a large<br>

&gt;     cluster.<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt; Why would they have fewer issues with the thread pools ? AIUI,<br>

&gt;     a TX<br>

&gt;      &gt;&gt; &gt; involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when<br>

&gt;     not using<br>

&gt;      &gt;&gt; &gt; TXs. And we&#39;re sync anyway...<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt;&gt; On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban &lt;<a href="mailto:bban@redhat.com">bban@redhat.com</a><br>

&gt;     &lt;mailto:<a href="mailto:bban@redhat.com">bban@redhat.com</a>&gt;<br>

</div><div><div class="h5">&gt;      &gt;&gt; &gt;&gt; &lt;mailto:<a href="mailto:bban@redhat.com">bban@redhat.com</a> &lt;mailto:<a href="mailto:bban@redhat.com">bban@redhat.com</a>&gt;&gt;&gt; wrote:<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     Following up on my own email, I changed the config to use<br>

&gt;     Pedro&#39;s<br>

&gt;      &gt;&gt; &gt;&gt;     excellent total order implementation:<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     &lt;transaction transactionMode=&quot;TRANSACTIONAL&quot;<br>

&gt;      &gt;&gt; &gt;&gt;     transactionProtocol=&quot;TOTAL_ORDER&quot; lockingMode=&quot;OPTIMISTIC&quot;<br>

&gt;      &gt;&gt; &gt;&gt;     useEagerLocking=&quot;true&quot; eagerLockSingleNode=&quot;true&quot;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;                   &lt;recovery enabled=&quot;false&quot;/&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     With 100 nodes and 25 requester threads/node, I did NOT<br>

&gt;     run into<br>

&gt;      &gt;&gt; &gt;&gt; any<br>

&gt;      &gt;&gt; &gt;&gt;     locking issues !<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     I could even go up to 200 requester threads/node and the<br>

&gt;     perf was ~<br>

&gt;      &gt;&gt; &gt;&gt;     7&#39;000-8&#39;000 requests/sec/node. Not too bad !<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     This really validates the concept of lockless total-order<br>

&gt;      &gt;&gt; &gt;&gt; dissemination<br>

&gt;      &gt;&gt; &gt;&gt;     of TXs; for the first time, this has been tested on a<br>

&gt;     large(r)<br>

&gt;      &gt;&gt; &gt;&gt; scale<br>

&gt;      &gt;&gt; &gt;&gt;     (previously only on 25 nodes) and IT WORKS ! :-)<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     I still believe we should implement my suggested solution for<br>

&gt;      &gt;&gt; &gt;&gt; non-TO<br>

&gt;      &gt;&gt; &gt;&gt;     configs, but short of configuring thread pools of 1000<br>

&gt;     threads or<br>

&gt;      &gt;&gt; &gt;&gt;     higher, I hope TO will allow me to finally test a 500 node<br>

&gt;      &gt;&gt; &gt;&gt; Infinispan<br>

&gt;      &gt;&gt; &gt;&gt;     cluster !<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     On 29/07/14 15:56, Bela Ban wrote:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Hi guys,<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; sorry for the long post, but I do think I ran into an<br>

&gt;     important<br>

&gt;      &gt;&gt; &gt;&gt;     problem<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; and we need to fix it ... :-)<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; I&#39;ve spent the last couple of days running the<br>

&gt;     IspnPerfTest [1]<br>

&gt;      &gt;&gt; &gt;&gt;     perftest<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; on Google Compute Engine (GCE), and I&#39;ve run into a<br>

&gt;     problem with<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Infinispan. It is a design problem and can be mitigated by<br>

&gt;      &gt;&gt; &gt;&gt; sizing<br>

&gt;      &gt;&gt; &gt;&gt;     thread<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; pools correctly, but cannot be eliminated entirely.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Symptom:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; --------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; IspnPerfTest has every node in a cluster perform<br>

&gt;     20&#39;000 requests<br>

&gt;      &gt;&gt; &gt;&gt;     on keys<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; in range [1..20000].<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; 80% of the requests are reads and 20% writes.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; By default, we have 25 requester threads per node and<br>

&gt;     100 nodes<br>

&gt;      &gt;&gt; &gt;&gt; in a<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; cluster, so a total of 2500 requester threads.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The cache used is NON-TRANSACTIONAL / dist-sync / 2<br>

&gt;     owners:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; &lt;namedCache name=&quot;clusteredCache&quot;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        &lt;clustering mode=&quot;distribution&quot;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;            &lt;stateTransfer awaitInitialTransfer=&quot;true&quot;/&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;            &lt;hash numOwners=&quot;2&quot;/&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;            &lt;sync replTimeout=&quot;20000&quot;/&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        &lt;/clustering&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        &lt;transaction transactionMode=&quot;NON_TRANSACTIONAL&quot;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; useEagerLocking=&quot;true&quot;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;             eagerLockSingleNode=&quot;true&quot;  /&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        &lt;locking lockAcquisitionTimeout=&quot;5000&quot;<br>

&gt;      &gt;&gt; &gt;&gt; concurrencyLevel=&quot;1000&quot;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;                 isolationLevel=&quot;READ_COMMITTED&quot;<br>

&gt;      &gt;&gt; &gt;&gt;     useLockStriping=&quot;false&quot; /&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; &lt;/namedCache&gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; It has 2 owners, a lock acquisition timeout of 5s and<br>

&gt;     a repl<br>

&gt;      &gt;&gt; &gt;&gt;     timeout of<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; 20s. Lock stripting is off, so we have 1 lock per key.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; When I run the test, I always get errors like those below:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; org.infinispan.util.concurrent.TimeoutException: Unable to<br>

&gt;      &gt;&gt; &gt;&gt;     acquire lock<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; after [10 seconds] on key [19386] for requestor<br>

&gt;      &gt;&gt; &gt;&gt;     [Thread[invoker-3,5,main]]!<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; org.infinispan.util.concurrent.TimeoutException: Node<br>

&gt;     m8.1 timed<br>

&gt;      &gt;&gt; &gt;&gt; out<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Investigation:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ------------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; When I looked at UNICAST3, I saw a lot of missing<br>

&gt;     messages on<br>

&gt;      &gt;&gt; &gt;&gt; the<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; receive side and unacked messages on the send side.<br>

&gt;     This caused<br>

&gt;      &gt;&gt; &gt;&gt; me to<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; look into the (mainly OOB) thread pools and - voila -<br>

&gt;     maxed out<br>

&gt;      &gt;&gt; &gt;&gt; !<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; I learned from Pedro that the Infinispan internal<br>

&gt;     thread pool<br>

&gt;      &gt;&gt; &gt;&gt; (with a<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; default of 32 threads) can be configured, so I<br>

&gt;     increased it to<br>

&gt;      &gt;&gt; &gt;&gt;     300 and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; increased the OOB pools as well.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; This mitigated the problem somewhat, but when I<br>

&gt;     increased the<br>

&gt;      &gt;&gt; &gt;&gt;     requester<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; threads to 100, I had the same problem again.<br>

&gt;     Apparently, the<br>

&gt;      &gt;&gt; &gt;&gt;     Infinispan<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; internal thread pool uses a rejection policy of &quot;run&quot;<br>

&gt;     and thus<br>

&gt;      &gt;&gt; &gt;&gt;     uses the<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; JGroups (OOB) thread when exhausted.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; I learned (from Pedro and Mircea) that GETs and PUTs<br>

&gt;     work as<br>

&gt;      &gt;&gt; &gt;&gt;     follows in<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; dist-sync / 2 owners:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - GETs are sent to the primary and backup owners and<br>

&gt;     the first<br>

&gt;      &gt;&gt; &gt;&gt;     response<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; received is returned to the caller. No locks are<br>

&gt;     acquired, so<br>

&gt;      &gt;&gt; &gt;&gt; GETs<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; shouldn&#39;t cause problems.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - A PUT(K) is sent to the primary owner of K<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - The primary owner<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        (1) locks K<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        (2) updates the backup owner synchronously<br>

&gt;     *while holding<br>

&gt;      &gt;&gt; &gt;&gt;     the lock*<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;        (3) releases the lock<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Hypothesis<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ----------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; (2) above is done while holding the lock. The sync<br>

&gt;     update of the<br>

&gt;      &gt;&gt; &gt;&gt;     backup<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; owner is done with the lock held to guarantee that the<br>

&gt;     primary<br>

&gt;      &gt;&gt; &gt;&gt; and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; backup owner of K have the same values for K.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; However, the sync update *inside the lock scope* slows<br>

&gt;     things<br>

&gt;      &gt;&gt; &gt;&gt;     down (can<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; it also lead to deadlocks?); there&#39;s the risk that the<br>

&gt;     request<br>

&gt;      &gt;&gt; &gt;&gt; is<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; dropped due to a full incoming thread pool, or that<br>

&gt;     the response<br>

&gt;      &gt;&gt; &gt;&gt;     is not<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; received because of the same, or that the locking at<br>

&gt;     the backup<br>

&gt;      &gt;&gt; &gt;&gt; owner<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; blocks for some time.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; If we have many threads modifying the same key, then<br>

&gt;     we have a<br>

&gt;      &gt;&gt; &gt;&gt;     backlog<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; of locking work against that key. Say we have 100<br>

&gt;     requester<br>

&gt;      &gt;&gt; &gt;&gt;     threads and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; a 100 node cluster. This means that we have 10&#39;000 threads<br>

&gt;      &gt;&gt; &gt;&gt; accessing<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; keys; with 2&#39;000 writers there&#39;s a big chance that<br>

&gt;     some writers<br>

&gt;      &gt;&gt; &gt;&gt;     pick the<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; same key at the same time.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; For example, if we have 100 threads accessing key K<br>

&gt;     and it takes<br>

&gt;      &gt;&gt; &gt;&gt;     3ms to<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; replicate K to the backup owner, then the last of the 100<br>

&gt;      &gt;&gt; &gt;&gt; threads<br>

&gt;      &gt;&gt; &gt;&gt;     waits<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ~300ms before it gets a chance to lock K on the<br>

&gt;     primary owner<br>

&gt;      &gt;&gt; &gt;&gt; and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; replicate it as well.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Just a small hiccup in sending the PUT to the primary<br>

&gt;     owner,<br>

&gt;      &gt;&gt; &gt;&gt;     sending the<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; modification to the backup owner, waitting for the<br>

&gt;     response, or<br>

&gt;      &gt;&gt; &gt;&gt;     GC, and<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; the delay will quickly become bigger.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Verification<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ----------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; To verify the above, I set numOwners to 1. This means<br>

&gt;     that the<br>

&gt;      &gt;&gt; &gt;&gt;     primary<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; owner of K does *not* send the modification to the<br>

&gt;     backup owner,<br>

&gt;      &gt;&gt; &gt;&gt;     it only<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; locks K, modifies K and unlocks K again.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; I ran the IspnPerfTest again on 100 nodes, with 25<br>

&gt;     requesters,<br>

&gt;      &gt;&gt; &gt;&gt; and NO<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; PROBLEM !<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; I then increased the requesters to 100, 150 and 200<br>

&gt;     and the test<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; completed flawlessly ! Performance was around *40&#39;000<br>

&gt;     requests<br>

&gt;      &gt;&gt; &gt;&gt;     per node<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; per sec* on 4-core boxes !<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Root cause<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ---------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; *******************<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The root cause is the sync RPC of K to the backup<br>

&gt;     owner(s) of K<br>

&gt;      &gt;&gt; &gt;&gt; while<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; the primary owner holds the lock for K.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; *******************<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; This causes a backlog of threads waiting for the lock<br>

&gt;     and that<br>

&gt;      &gt;&gt; &gt;&gt;     backlog<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; can grow to exhaust the thread pools. First the Infinispan<br>

&gt;      &gt;&gt; &gt;&gt; internal<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; thread pool, then the JGroups OOB thread pool. The<br>

&gt;     latter causes<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; retransmissions to get dropped, which compounds the<br>

&gt;     problem...<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Goal<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ----<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The goal is to make sure that primary and backup<br>

&gt;     owner(s) of K<br>

&gt;      &gt;&gt; &gt;&gt;     have the<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; same value for K.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Simply sending the modification to the backup owner(s)<br>

&gt;      &gt;&gt; &gt;&gt; asynchronously<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; won&#39;t guarantee this, as modification messages might get<br>

&gt;      &gt;&gt; &gt;&gt;     processed out<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; of order as they&#39;re OOB !<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Suggested solution<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ----------------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The modification RPC needs to be invoked *outside of<br>

&gt;     the lock<br>

&gt;      &gt;&gt; &gt;&gt; scope*:<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - lock K<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - modify K<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - unlock K<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - send modification to backup owner(s) // outside the<br>

&gt;     lock scope<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The primary owner puts the modification of K into a<br>

&gt;     queue from<br>

&gt;      &gt;&gt; &gt;&gt;     where a<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; separate thread/task removes it. The thread then<br>

&gt;     invokes the<br>

&gt;      &gt;&gt; &gt;&gt;     PUT(K) on<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; the backup owner(s).<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The queue has the modified keys in FIFO order, so the<br>

&gt;      &gt;&gt; &gt;&gt; modifications<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; arrive at the backup owner(s) in the right order.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; This requires that the way GET is implemented changes<br>

&gt;     slightly:<br>

&gt;      &gt;&gt; &gt;&gt;     instead<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; of invoking a GET on all owners of K, we only invoke<br>

&gt;     it on the<br>

&gt;      &gt;&gt; &gt;&gt;     primary<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; owner, then the next-in-line etc.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The reason for this is that the backup owner(s) may<br>

&gt;     not yet have<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; received the modification of K.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; This is a better impl anyway (we discussed this<br>

&gt;     before) becuse<br>

&gt;      &gt;&gt; &gt;&gt; it<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; generates less traffic; in the normal case, all but 1 GET<br>

&gt;      &gt;&gt; &gt;&gt;     requests are<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; unnecessary.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Improvement<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; -----------<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; The above solution can be simplified and even made more<br>

&gt;      &gt;&gt; &gt;&gt; efficient.<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Re-using concepts from IRAC [2], we can simply store the<br>

&gt;      &gt;&gt; &gt;&gt; modified<br>

&gt;      &gt;&gt; &gt;&gt;     *keys*<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; in the modification queue. The modification<br>

&gt;     replication thread<br>

&gt;      &gt;&gt; &gt;&gt;     removes<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; the key, gets the current value and invokes a<br>

&gt;     PUT/REMOVE on the<br>

&gt;      &gt;&gt; &gt;&gt;     backup<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; owner(s).<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Even better: a key is only ever added *once*, so if we<br>

&gt;     have<br>

&gt;      &gt;&gt; &gt;&gt;     [5,2,17,3],<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; adding key 2 is a no-op because the processing of key<br>

&gt;     2 (in<br>

&gt;      &gt;&gt; &gt;&gt; second<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; position in the queue) will fetch the up-to-date value<br>

&gt;     anyway !<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Misc<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; ----<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; - Could we possibly use total order to send the<br>

&gt;     updates in TO ?<br>

&gt;      &gt;&gt; &gt;&gt;     TBD (Pedro?)<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; Thoughts ?<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;      &gt; [1] <a href="https://github.com/belaban/IspnPerfTest" target="_blank">https://github.com/belaban/IspnPerfTest</a><br>

&gt;      &gt;&gt; &gt;&gt;      &gt; [2]<br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;     <a href="https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering" target="_blank">https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering</a><br>

&gt;      &gt;&gt; &gt;&gt;      &gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;     --<br>

&gt;      &gt;&gt; &gt;&gt;     Bela Ban, JGroups lead (<a href="http://www.jgroups.org" target="_blank">http://www.jgroups.org</a>)<br>

&gt;      &gt;&gt; &gt;&gt;     _______________________________________________<br>

&gt;      &gt;&gt; &gt;&gt;     infinispan-dev mailing list<br>

&gt;      &gt;&gt; &gt;&gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

</div></div>&gt;      &gt;&gt; &gt;&gt; &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

<div class="">&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;&gt; _______________________________________________<br>

&gt;      &gt;&gt; &gt;&gt; infinispan-dev mailing list<br>

&gt;      &gt;&gt; &gt;&gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

</div>&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

<div class="">&gt;      &gt;&gt; &gt;&gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;      &gt;&gt; &gt;&gt;<br>

&gt;      &gt;&gt; &gt;<br>

&gt;      &gt;&gt; &gt; --<br>

&gt;      &gt;&gt; &gt; Bela Ban, JGroups lead (<a href="http://www.jgroups.org" target="_blank">http://www.jgroups.org</a>)<br>

&gt;      &gt;&gt; &gt; _______________________________________________<br>

&gt;      &gt;&gt; &gt; infinispan-dev mailing list<br>

&gt;      &gt;&gt; &gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

&gt;      &gt;&gt; &gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;      &gt;&gt; _______________________________________________<br>

&gt;      &gt;&gt; infinispan-dev mailing list<br>

&gt;      &gt;&gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

&gt;      &gt;&gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt;<br>

&gt;      &gt; _______________________________________________<br>

&gt;      &gt; infinispan-dev mailing list<br>

&gt;      &gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

&gt;     &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

&gt;      &gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;     _______________________________________________<br>

&gt;     infinispan-dev mailing list<br>

</div>&gt;     <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a> &lt;mailto:<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>&gt;<br>

<div class="HOEnZb"><div class="h5">&gt;     <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; infinispan-dev mailing list<br>

&gt; <a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

&gt; <a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

&gt;<br>

_______________________________________________<br>

infinispan-dev mailing list<br>

<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

<a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

</div></div></blockquote></div><br></div></div>