<div dir="ltr">Have you tried regular optimistic/pessimistic transactions as well? They *should* have less issues with the OOB thread pool than non-tx mode, and I'm quite curious how they stack against TO in such a large cluster.</div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <span dir="ltr"><<a href="mailto:bban@redhat.com" target="_blank">bban@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Following up on my own email, I changed the config to use Pedro's<br>
excellent total order implementation:<br>
<br>
<transaction transactionMode="TRANSACTIONAL"<br>
transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"<br>
useEagerLocking="true" eagerLockSingleNode="true"><br>
<recovery enabled="false"/><br>
<br>
With 100 nodes and 25 requester threads/node, I did NOT run into any<br>
locking issues !<br>
<br>
I could even go up to 200 requester threads/node and the perf was ~<br>
7'000-8'000 requests/sec/node. Not too bad !<br>
<br>
This really validates the concept of lockless total-order dissemination<br>
of TXs; for the first time, this has been tested on a large(r) scale<br>
(previously only on 25 nodes) and IT WORKS ! :-)<br>
<br>
I still believe we should implement my suggested solution for non-TO<br>
configs, but short of configuring thread pools of 1000 threads or<br>
higher, I hope TO will allow me to finally test a 500 node Infinispan<br>
cluster !<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
On 29/07/14 15:56, Bela Ban wrote:<br>
> Hi guys,<br>
><br>
> sorry for the long post, but I do think I ran into an important problem<br>
> and we need to fix it ... :-)<br>
><br>
> I've spent the last couple of days running the IspnPerfTest [1] perftest<br>
> on Google Compute Engine (GCE), and I've run into a problem with<br>
> Infinispan. It is a design problem and can be mitigated by sizing thread<br>
> pools correctly, but cannot be eliminated entirely.<br>
><br>
><br>
> Symptom:<br>
> --------<br>
> IspnPerfTest has every node in a cluster perform 20'000 requests on keys<br>
> in range [1..20000].<br>
><br>
> 80% of the requests are reads and 20% writes.<br>
><br>
> By default, we have 25 requester threads per node and 100 nodes in a<br>
> cluster, so a total of 2500 requester threads.<br>
><br>
> The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:<br>
><br>
> <namedCache name="clusteredCache"><br>
> <clustering mode="distribution"><br>
> <stateTransfer awaitInitialTransfer="true"/><br>
> <hash numOwners="2"/><br>
> <sync replTimeout="20000"/><br>
> </clustering><br>
><br>
> <transaction transactionMode="NON_TRANSACTIONAL"<br>
> useEagerLocking="true"<br>
> eagerLockSingleNode="true" /><br>
> <locking lockAcquisitionTimeout="5000" concurrencyLevel="1000"<br>
> isolationLevel="READ_COMMITTED" useLockStriping="false" /><br>
> </namedCache><br>
><br>
> It has 2 owners, a lock acquisition timeout of 5s and a repl timeout of<br>
> 20s. Lock stripting is off, so we have 1 lock per key.<br>
><br>
> When I run the test, I always get errors like those below:<br>
><br>
> org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock<br>
> after [10 seconds] on key [19386] for requestor [Thread[invoker-3,5,main]]!<br>
> Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]<br>
><br>
> and<br>
><br>
> org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out<br>
><br>
><br>
> Investigation:<br>
> ------------<br>
> When I looked at UNICAST3, I saw a lot of missing messages on the<br>
> receive side and unacked messages on the send side. This caused me to<br>
> look into the (mainly OOB) thread pools and - voila - maxed out !<br>
><br>
> I learned from Pedro that the Infinispan internal thread pool (with a<br>
> default of 32 threads) can be configured, so I increased it to 300 and<br>
> increased the OOB pools as well.<br>
><br>
> This mitigated the problem somewhat, but when I increased the requester<br>
> threads to 100, I had the same problem again. Apparently, the Infinispan<br>
> internal thread pool uses a rejection policy of "run" and thus uses the<br>
> JGroups (OOB) thread when exhausted.<br>
><br>
> I learned (from Pedro and Mircea) that GETs and PUTs work as follows in<br>
> dist-sync / 2 owners:<br>
> - GETs are sent to the primary and backup owners and the first response<br>
> received is returned to the caller. No locks are acquired, so GETs<br>
> shouldn't cause problems.<br>
><br>
> - A PUT(K) is sent to the primary owner of K<br>
> - The primary owner<br>
> (1) locks K<br>
> (2) updates the backup owner synchronously *while holding the lock*<br>
> (3) releases the lock<br>
><br>
><br>
> Hypothesis<br>
> ----------<br>
> (2) above is done while holding the lock. The sync update of the backup<br>
> owner is done with the lock held to guarantee that the primary and<br>
> backup owner of K have the same values for K.<br>
><br>
> However, the sync update *inside the lock scope* slows things down (can<br>
> it also lead to deadlocks?); there's the risk that the request is<br>
> dropped due to a full incoming thread pool, or that the response is not<br>
> received because of the same, or that the locking at the backup owner<br>
> blocks for some time.<br>
><br>
> If we have many threads modifying the same key, then we have a backlog<br>
> of locking work against that key. Say we have 100 requester threads and<br>
> a 100 node cluster. This means that we have 10'000 threads accessing<br>
> keys; with 2'000 writers there's a big chance that some writers pick the<br>
> same key at the same time.<br>
><br>
> For example, if we have 100 threads accessing key K and it takes 3ms to<br>
> replicate K to the backup owner, then the last of the 100 threads waits<br>
> ~300ms before it gets a chance to lock K on the primary owner and<br>
> replicate it as well.<br>
><br>
> Just a small hiccup in sending the PUT to the primary owner, sending the<br>
> modification to the backup owner, waitting for the response, or GC, and<br>
> the delay will quickly become bigger.<br>
><br>
><br>
> Verification<br>
> ----------<br>
> To verify the above, I set numOwners to 1. This means that the primary<br>
> owner of K does *not* send the modification to the backup owner, it only<br>
> locks K, modifies K and unlocks K again.<br>
><br>
> I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO<br>
> PROBLEM !<br>
><br>
> I then increased the requesters to 100, 150 and 200 and the test<br>
> completed flawlessly ! Performance was around *40'000 requests per node<br>
> per sec* on 4-core boxes !<br>
><br>
><br>
> Root cause<br>
> ---------<br>
> *******************<br>
> The root cause is the sync RPC of K to the backup owner(s) of K while<br>
> the primary owner holds the lock for K.<br>
> *******************<br>
><br>
> This causes a backlog of threads waiting for the lock and that backlog<br>
> can grow to exhaust the thread pools. First the Infinispan internal<br>
> thread pool, then the JGroups OOB thread pool. The latter causes<br>
> retransmissions to get dropped, which compounds the problem...<br>
><br>
><br>
> Goal<br>
> ----<br>
> The goal is to make sure that primary and backup owner(s) of K have the<br>
> same value for K.<br>
><br>
> Simply sending the modification to the backup owner(s) asynchronously<br>
> won't guarantee this, as modification messages might get processed out<br>
> of order as they're OOB !<br>
><br>
><br>
> Suggested solution<br>
> ----------------<br>
> The modification RPC needs to be invoked *outside of the lock scope*:<br>
> - lock K<br>
> - modify K<br>
> - unlock K<br>
> - send modification to backup owner(s) // outside the lock scope<br>
><br>
> The primary owner puts the modification of K into a queue from where a<br>
> separate thread/task removes it. The thread then invokes the PUT(K) on<br>
> the backup owner(s).<br>
><br>
> The queue has the modified keys in FIFO order, so the modifications<br>
> arrive at the backup owner(s) in the right order.<br>
><br>
> This requires that the way GET is implemented changes slightly: instead<br>
> of invoking a GET on all owners of K, we only invoke it on the primary<br>
> owner, then the next-in-line etc.<br>
><br>
> The reason for this is that the backup owner(s) may not yet have<br>
> received the modification of K.<br>
><br>
> This is a better impl anyway (we discussed this before) becuse it<br>
> generates less traffic; in the normal case, all but 1 GET requests are<br>
> unnecessary.<br>
><br>
><br>
><br>
> Improvement<br>
> -----------<br>
> The above solution can be simplified and even made more efficient.<br>
> Re-using concepts from IRAC [2], we can simply store the modified *keys*<br>
> in the modification queue. The modification replication thread removes<br>
> the key, gets the current value and invokes a PUT/REMOVE on the backup<br>
> owner(s).<br>
><br>
> Even better: a key is only ever added *once*, so if we have [5,2,17,3],<br>
> adding key 2 is a no-op because the processing of key 2 (in second<br>
> position in the queue) will fetch the up-to-date value anyway !<br>
><br>
><br>
> Misc<br>
> ----<br>
> - Could we possibly use total order to send the updates in TO ? TBD (Pedro?)<br>
><br>
><br>
> Thoughts ?<br>
><br>
><br>
> [1] <a href="https://github.com/belaban/IspnPerfTest" target="_blank">https://github.com/belaban/IspnPerfTest</a><br>
> [2]<br>
> <a href="https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering" target="_blank">https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering</a><br>
><br>
<br>
--<br>
Bela Ban, JGroups lead (<a href="http://www.jgroups.org" target="_blank">http://www.jgroups.org</a>)<br>
_______________________________________________<br>
infinispan-dev mailing list<br>
<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>
<a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>
</div></div></blockquote></div><br></div>