<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jul 29, 2014 at 4:56 PM, Bela Ban <span dir="ltr">&lt;<a href="mailto:bban@redhat.com" target="_blank">bban@redhat.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi guys,<br>

<br>

sorry for the long post, but I do think I ran into an important problem<br>

and we need to fix it ... :-)<br>

<br>

I&#39;ve spent the last couple of days running the IspnPerfTest [1] perftest<br>

on Google Compute Engine (GCE), and I&#39;ve run into a problem with<br>

Infinispan. It is a design problem and can be mitigated by sizing thread<br>

pools correctly, but cannot be eliminated entirely.<br>

<br>

<br>

Symptom:<br>

--------<br>

IspnPerfTest has every node in a cluster perform 20&#39;000 requests on keys<br>

in range [1..20000].<br>

<br>

80% of the requests are reads and 20% writes.<br>

<br>

By default, we have 25 requester threads per node and 100 nodes in a<br>

cluster, so a total of 2500 requester threads.<br>

<br>

The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:<br>

<br>

&lt;namedCache name=&quot;clusteredCache&quot;&gt;<br>

      &lt;clustering mode=&quot;distribution&quot;&gt;<br>

          &lt;stateTransfer awaitInitialTransfer=&quot;true&quot;/&gt;<br>

          &lt;hash numOwners=&quot;2&quot;/&gt;<br>

          &lt;sync replTimeout=&quot;20000&quot;/&gt;<br>

      &lt;/clustering&gt;<br>

<br>

      &lt;transaction transactionMode=&quot;NON_TRANSACTIONAL&quot;<br>

useEagerLocking=&quot;true&quot;<br>

           eagerLockSingleNode=&quot;true&quot;  /&gt;<br>

      &lt;locking lockAcquisitionTimeout=&quot;5000&quot; concurrencyLevel=&quot;1000&quot;<br>

               isolationLevel=&quot;READ_COMMITTED&quot; useLockStriping=&quot;false&quot; /&gt;<br>

&lt;/namedCache&gt;<br>

<br>

It has 2 owners, a lock acquisition timeout of 5s and a repl timeout of<br>

20s. Lock stripting is off, so we have 1 lock per key.<br>

<br>

When I run the test, I always get errors like those below:<br>

<br>

org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock<br>

after [10 seconds] on key [19386] for requestor [Thread[invoker-3,5,main]]!<br>

Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]<br>

<br>

and<br>

<br>

org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out<br>

<br>

<br>

Investigation:<br>

------------<br>

When I looked at UNICAST3, I saw a lot of missing messages on the<br>

receive side and unacked messages on the send side. This caused me to<br>

look into the (mainly OOB) thread pools and - voila - maxed out !<br>

<br>

I learned from Pedro that the Infinispan internal thread pool (with a<br>

default of 32 threads) can be configured, so I increased it to 300 and<br>

increased the OOB pools as well.<br>

<br>

This mitigated the problem somewhat, but when I increased the requester<br>

threads to 100, I had the same problem again. Apparently, the Infinispan<br>

internal thread pool uses a rejection policy of &quot;run&quot; and thus uses the<br>

JGroups (OOB) thread when exhausted.<br></blockquote><div><br></div><div>We can&#39;t use another rejection policy in the remote executor because the message won&#39;t be re-delivered by JGroups, and we can&#39;t use a queue either.<br>


</div><div><br></div><div>Pedro is working on ISPN-2849, which should help with the remote/OOB thread pool exhaustion. It is a bit tricky, though, because our interceptors assume they will be able to access stack variables after replication.</div>


<div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

I learned (from Pedro and Mircea) that GETs and PUTs work as follows in<br>

dist-sync / 2 owners:<br>

- GETs are sent to the primary and backup owners and the first response<br>

received is returned to the caller. No locks are acquired, so GETs<br>

shouldn&#39;t cause problems.<br>

<br>

- A PUT(K) is sent to the primary owner of K<br>

- The primary owner<br>

      (1) locks K<br>

      (2) updates the backup owner synchronously *while holding the lock*<br>

      (3) releases the lock<br>

<br>

<br>

Hypothesis<br>

----------<br>

(2) above is done while holding the lock. The sync update of the backup<br>

owner is done with the lock held to guarantee that the primary and<br>

backup owner of K have the same values for K.<br></blockquote><div><br></div><div>And something else: if the primary owner reports that a write was successful and then dies, a read should find the updated value on the backup owner(s).</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

However, the sync update *inside the lock scope* slows things down (can<br>

it also lead to deadlocks?); there&#39;s the risk that the request is<br>

dropped due to a full incoming thread pool, or that the response is not<br>

received because of the same, or that the locking at the backup owner<br>

blocks for some time.<br></blockquote><div><br></div><div>There is no locking on the backup owner, so there are no deadlocks.</div><div>There is indeed a risk of the OOB/remote thread pools being full.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<br>

If we have many threads modifying the same key, then we have a backlog<br>

of locking work against that key. Say we have 100 requester threads and<br>

a 100 node cluster. This means that we have 10&#39;000 threads accessing<br>

keys; with 2&#39;000 writers there&#39;s a big chance that some writers pick the<br>

same key at the same time.<br>

<br>

For example, if we have 100 threads accessing key K and it takes 3ms to<br>

replicate K to the backup owner, then the last of the 100 threads waits<br>

~300ms before it gets a chance to lock K on the primary owner and<br>

replicate it as well.<br>

<br>

Just a small hiccup in sending the PUT to the primary owner, sending the<br>

modification to the backup owner, waitting for the response, or GC, and<br>

the delay will quickly become bigger.<br>

<br>

<br>

Verification<br>

----------<br>

To verify the above, I set numOwners to 1. This means that the primary<br>

owner of K does *not* send the modification to the backup owner, it only<br>

locks K, modifies K and unlocks K again.<br>

<br>

I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO<br>

PROBLEM !<br>

<br>

I then increased the requesters to 100, 150 and 200 and the test<br>

completed flawlessly ! Performance was around *40&#39;000 requests per node<br>

per sec* on 4-core boxes !<br>

<br>

<br>

Root cause<br>

---------<br>

*******************<br>

The root cause is the sync RPC of K to the backup owner(s) of K while<br>

the primary owner holds the lock for K.<br>

*******************<br>

<br>

This causes a backlog of threads waiting for the lock and that backlog<br>

can grow to exhaust the thread pools. First the Infinispan internal<br>

thread pool, then the JGroups OOB thread pool. The latter causes<br>

retransmissions to get dropped, which compounds the problem...<br>

<br>

<br>

Goal<br>

----<br>

The goal is to make sure that primary and backup owner(s) of K have the<br>

same value for K.<br>

<br>

Simply sending the modification to the backup owner(s) asynchronously<br>

won&#39;t guarantee this, as modification messages might get processed out<br>

of order as they&#39;re OOB !<br>

<br>

<br>

Suggested solution<br>

----------------<br>

The modification RPC needs to be invoked *outside of the lock scope*:<br>

- lock K<br>

- modify K<br>

- unlock K<br>

- send modification to backup owner(s) // outside the lock scope<br>

<br>

The primary owner puts the modification of K into a queue from where a<br>

separate thread/task removes it. The thread then invokes the PUT(K) on<br>

the backup owner(s).<br></blockquote><div><br></div><div>Does the replication thread execute the PUT(k) synchronously, or asynchronously? I assume asynchronously, otherwise the replication thread wouldn&#39;t be able to keep up with the writers.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

The queue has the modified keys in FIFO order, so the modifications<br>

arrive at the backup owner(s) in the right order.<br></blockquote><div><br></div><div>Sending the RPC to the backup owners asynchronously, while holding the key lock, would do the same thing.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<br>

This requires that the way GET is implemented changes slightly: instead<br>

of invoking a GET on all owners of K, we only invoke it on the primary<br>

owner, then the next-in-line etc.<br></blockquote><div><br></div><div>What&#39;s the next-in-line owner? A backup won&#39;t have the last version of the data.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<br>

The reason for this is that the backup owner(s) may not yet have<br>

received the modification of K.<br>

<br></blockquote><div><br></div><div>OTOH, if the primary owner dies, we have to ask a backup, and we can lose the modifications not yet replicated by the primary.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


This is a better impl anyway (we discussed this before) becuse it<br>

generates less traffic; in the normal case, all but 1 GET requests are<br>

unnecessary.<br>

<br></blockquote><div><br></div><div>I have a WIP branch for this and it seemed to work fine. Test suite speed seemed about the same, but I didn&#39;t get to do a real performance test.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<br>

<br>

Improvement<br>

-----------<br>

The above solution can be simplified and even made more efficient.<br>

Re-using concepts from IRAC [2], we can simply store the modified *keys*<br>

in the modification queue. The modification replication thread removes<br>

the key, gets the current value and invokes a PUT/REMOVE on the backup<br>

owner(s).<br>

<br>

Even better: a key is only ever added *once*, so if we have [5,2,17,3],<br>

adding key 2 is a no-op because the processing of key 2 (in second<br>

position in the queue) will fetch the up-to-date value anyway !<br>

<br>

<br>

Misc<br>

----<br>

- Could we possibly use total order to send the updates in TO ? TBD (Pedro?)<br>

<br>

<br>

Thoughts ?<br>

<br>

<br>

[1] <a href="https://github.com/belaban/IspnPerfTest" target="_blank">https://github.com/belaban/IspnPerfTest</a><br>

[2]<br>

<a href="https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering" target="_blank">https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering</a><br>

<span class=""><font color="#888888"><br>

--<br>

Bela Ban, JGroups lead (<a href="http://www.jgroups.org" target="_blank">http://www.jgroups.org</a>)<br>

_______________________________________________<br>

infinispan-dev mailing list<br>

<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

<a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

</font></span></blockquote></div><br></div></div>