[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Jul 29 09:56:06 EDT 2014

Hi guys,

sorry for the long post, but I do think I ran into an important problem 
and we need to fix it ... :-)

I've spent the last couple of days running the IspnPerfTest [1] perftest 
on Google Compute Engine (GCE), and I've run into a problem with 
Infinispan. It is a design problem and can be mitigated by sizing thread 
pools correctly, but cannot be eliminated entirely.

Symptom:
--------
IspnPerfTest has every node in a cluster perform 20'000 requests on keys 
in range [1..20000].

80% of the requests are reads and 20% writes.

By default, we have 25 requester threads per node and 100 nodes in a 
cluster, so a total of 2500 requester threads.

The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:

<namedCache name="clusteredCache">
      <clustering mode="distribution">
          <stateTransfer awaitInitialTransfer="true"/>
          <hash numOwners="2"/>
          <sync replTimeout="20000"/>
      </clustering>

      <transaction transactionMode="NON_TRANSACTIONAL" 
useEagerLocking="true"
           eagerLockSingleNode="true"  />
      <locking lockAcquisitionTimeout="5000" concurrencyLevel="1000"
               isolationLevel="READ_COMMITTED" useLockStriping="false" />
</namedCache>

It has 2 owners, a lock acquisition timeout of 5s and a repl timeout of 
20s. Lock stripting is off, so we have 1 lock per key.

When I run the test, I always get errors like those below:

org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock 
after [10 seconds] on key [19386] for requestor [Thread[invoker-3,5,main]]!
Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]

and

org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out

Investigation:
------------
When I looked at UNICAST3, I saw a lot of missing messages on the 
receive side and unacked messages on the send side. This caused me to 
look into the (mainly OOB) thread pools and - voila - maxed out !

I learned from Pedro that the Infinispan internal thread pool (with a 
default of 32 threads) can be configured, so I increased it to 300 and 
increased the OOB pools as well.

This mitigated the problem somewhat, but when I increased the requester 
threads to 100, I had the same problem again. Apparently, the Infinispan 
internal thread pool uses a rejection policy of "run" and thus uses the 
JGroups (OOB) thread when exhausted.

I learned (from Pedro and Mircea) that GETs and PUTs work as follows in 
dist-sync / 2 owners:
- GETs are sent to the primary and backup owners and the first response 
received is returned to the caller. No locks are acquired, so GETs 
shouldn't cause problems.

- A PUT(K) is sent to the primary owner of K
- The primary owner
      (1) locks K
      (2) updates the backup owner synchronously *while holding the lock*
      (3) releases the lock

Hypothesis
----------
(2) above is done while holding the lock. The sync update of the backup 
owner is done with the lock held to guarantee that the primary and 
backup owner of K have the same values for K.

However, the sync update *inside the lock scope* slows things down (can 
it also lead to deadlocks?); there's the risk that the request is 
dropped due to a full incoming thread pool, or that the response is not 
received because of the same, or that the locking at the backup owner 
blocks for some time.

If we have many threads modifying the same key, then we have a backlog 
of locking work against that key. Say we have 100 requester threads and 
a 100 node cluster. This means that we have 10'000 threads accessing 
keys; with 2'000 writers there's a big chance that some writers pick the 
same key at the same time.

For example, if we have 100 threads accessing key K and it takes 3ms to 
replicate K to the backup owner, then the last of the 100 threads waits 
~300ms before it gets a chance to lock K on the primary owner and 
replicate it as well.

Just a small hiccup in sending the PUT to the primary owner, sending the 
modification to the backup owner, waitting for the response, or GC, and 
the delay will quickly become bigger.

Verification
----------
To verify the above, I set numOwners to 1. This means that the primary 
owner of K does *not* send the modification to the backup owner, it only 
locks K, modifies K and unlocks K again.

I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO 
PROBLEM !

I then increased the requesters to 100, 150 and 200 and the test 
completed flawlessly ! Performance was around *40'000 requests per node 
per sec* on 4-core boxes !

Root cause
---------
*******************
The root cause is the sync RPC of K to the backup owner(s) of K while 
the primary owner holds the lock for K.
*******************

This causes a backlog of threads waiting for the lock and that backlog 
can grow to exhaust the thread pools. First the Infinispan internal 
thread pool, then the JGroups OOB thread pool. The latter causes 
retransmissions to get dropped, which compounds the problem...

Goal
----
The goal is to make sure that primary and backup owner(s) of K have the 
same value for K.

Simply sending the modification to the backup owner(s) asynchronously 
won't guarantee this, as modification messages might get processed out 
of order as they're OOB !

Suggested solution
----------------
The modification RPC needs to be invoked *outside of the lock scope*:
- lock K
- modify K
- unlock K
- send modification to backup owner(s) // outside the lock scope

The primary owner puts the modification of K into a queue from where a 
separate thread/task removes it. The thread then invokes the PUT(K) on 
the backup owner(s).

The queue has the modified keys in FIFO order, so the modifications 
arrive at the backup owner(s) in the right order.

This requires that the way GET is implemented changes slightly: instead 
of invoking a GET on all owners of K, we only invoke it on the primary 
owner, then the next-in-line etc.

The reason for this is that the backup owner(s) may not yet have 
received the modification of K.

This is a better impl anyway (we discussed this before) becuse it 
generates less traffic; in the normal case, all but 1 GET requests are 
unnecessary.

Improvement
-----------
The above solution can be simplified and even made more efficient. 
Re-using concepts from IRAC [2], we can simply store the modified *keys* 
in the modification queue. The modification replication thread removes 
the key, gets the current value and invokes a PUT/REMOVE on the backup 
owner(s).

Even better: a key is only ever added *once*, so if we have [5,2,17,3], 
adding key 2 is a no-op because the processing of key 2 (in second 
position in the queue) will fetch the up-to-date value anyway !

Misc
----
- Could we possibly use total order to send the updates in TO ? TBD (Pedro?)

Thoughts ?

[1] https://github.com/belaban/IspnPerfTest
[2] 
https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)