[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Jul 29 10:39:04 EDT 2014

On Tue, Jul 29, 2014 at 4:56 PM, Bela Ban <bban at redhat.com> wrote:

> Hi guys,
>
> sorry for the long post, but I do think I ran into an important problem
> and we need to fix it ... :-)
>
> I've spent the last couple of days running the IspnPerfTest [1] perftest
> on Google Compute Engine (GCE), and I've run into a problem with
> Infinispan. It is a design problem and can be mitigated by sizing thread
> pools correctly, but cannot be eliminated entirely.
>
>
> Symptom:
> --------
> IspnPerfTest has every node in a cluster perform 20'000 requests on keys
> in range [1..20000].
>
> 80% of the requests are reads and 20% writes.
>
> By default, we have 25 requester threads per node and 100 nodes in a
> cluster, so a total of 2500 requester threads.
>
> The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:
>
> <namedCache name="clusteredCache">
>       <clustering mode="distribution">
>           <stateTransfer awaitInitialTransfer="true"/>
>           <hash numOwners="2"/>
>           <sync replTimeout="20000"/>
>       </clustering>
>
>       <transaction transactionMode="NON_TRANSACTIONAL"
> useEagerLocking="true"
>            eagerLockSingleNode="true"  />
>       <locking lockAcquisitionTimeout="5000" concurrencyLevel="1000"
>                isolationLevel="READ_COMMITTED" useLockStriping="false" />
> </namedCache>
>
> It has 2 owners, a lock acquisition timeout of 5s and a repl timeout of
> 20s. Lock stripting is off, so we have 1 lock per key.
>
> When I run the test, I always get errors like those below:
>
> org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock
> after [10 seconds] on key [19386] for requestor [Thread[invoker-3,5,main]]!
> Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
>
> and
>
> org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out
>
>
> Investigation:
> ------------
> When I looked at UNICAST3, I saw a lot of missing messages on the
> receive side and unacked messages on the send side. This caused me to
> look into the (mainly OOB) thread pools and - voila - maxed out !
>
> I learned from Pedro that the Infinispan internal thread pool (with a
> default of 32 threads) can be configured, so I increased it to 300 and
> increased the OOB pools as well.
>
> This mitigated the problem somewhat, but when I increased the requester
> threads to 100, I had the same problem again. Apparently, the Infinispan
> internal thread pool uses a rejection policy of "run" and thus uses the
> JGroups (OOB) thread when exhausted.
>

We can't use another rejection policy in the remote executor because the
message won't be re-delivered by JGroups, and we can't use a queue either.

Pedro is working on ISPN-2849, which should help with the remote/OOB thread
pool exhaustion. It is a bit tricky, though, because our interceptors
assume they will be able to access stack variables after replication.

>
> I learned (from Pedro and Mircea) that GETs and PUTs work as follows in
> dist-sync / 2 owners:
> - GETs are sent to the primary and backup owners and the first response
> received is returned to the caller. No locks are acquired, so GETs
> shouldn't cause problems.
>
> - A PUT(K) is sent to the primary owner of K
> - The primary owner
>       (1) locks K
>       (2) updates the backup owner synchronously *while holding the lock*
>       (3) releases the lock
>
>
> Hypothesis
> ----------
> (2) above is done while holding the lock. The sync update of the backup
> owner is done with the lock held to guarantee that the primary and
> backup owner of K have the same values for K.
>

And something else: if the primary owner reports that a write was
successful and then dies, a read should find the updated value on the
backup owner(s).

> However, the sync update *inside the lock scope* slows things down (can
> it also lead to deadlocks?); there's the risk that the request is
> dropped due to a full incoming thread pool, or that the response is not
> received because of the same, or that the locking at the backup owner
> blocks for some time.
>

There is no locking on the backup owner, so there are no deadlocks.
There is indeed a risk of the OOB/remote thread pools being full.

>
> If we have many threads modifying the same key, then we have a backlog
> of locking work against that key. Say we have 100 requester threads and
> a 100 node cluster. This means that we have 10'000 threads accessing
> keys; with 2'000 writers there's a big chance that some writers pick the
> same key at the same time.
>
> For example, if we have 100 threads accessing key K and it takes 3ms to
> replicate K to the backup owner, then the last of the 100 threads waits
> ~300ms before it gets a chance to lock K on the primary owner and
> replicate it as well.
>
> Just a small hiccup in sending the PUT to the primary owner, sending the
> modification to the backup owner, waitting for the response, or GC, and
> the delay will quickly become bigger.
>
>
> Verification
> ----------
> To verify the above, I set numOwners to 1. This means that the primary
> owner of K does *not* send the modification to the backup owner, it only
> locks K, modifies K and unlocks K again.
>
> I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO
> PROBLEM !
>
> I then increased the requesters to 100, 150 and 200 and the test
> completed flawlessly ! Performance was around *40'000 requests per node
> per sec* on 4-core boxes !
>
>
> Root cause
> ---------
> *******************
> The root cause is the sync RPC of K to the backup owner(s) of K while
> the primary owner holds the lock for K.
> *******************
>
> This causes a backlog of threads waiting for the lock and that backlog
> can grow to exhaust the thread pools. First the Infinispan internal
> thread pool, then the JGroups OOB thread pool. The latter causes
> retransmissions to get dropped, which compounds the problem...
>
>
> Goal
> ----
> The goal is to make sure that primary and backup owner(s) of K have the
> same value for K.
>
> Simply sending the modification to the backup owner(s) asynchronously
> won't guarantee this, as modification messages might get processed out
> of order as they're OOB !
>
>
> Suggested solution
> ----------------
> The modification RPC needs to be invoked *outside of the lock scope*:
> - lock K
> - modify K
> - unlock K
> - send modification to backup owner(s) // outside the lock scope
>
> The primary owner puts the modification of K into a queue from where a
> separate thread/task removes it. The thread then invokes the PUT(K) on
> the backup owner(s).
>

Does the replication thread execute the PUT(k) synchronously, or
asynchronously? I assume asynchronously, otherwise the replication thread
wouldn't be able to keep up with the writers.

>
> The queue has the modified keys in FIFO order, so the modifications
> arrive at the backup owner(s) in the right order.
>

Sending the RPC to the backup owners asynchronously, while holding the key
lock, would do the same thing.

>
> This requires that the way GET is implemented changes slightly: instead
> of invoking a GET on all owners of K, we only invoke it on the primary
> owner, then the next-in-line etc.
>

What's the next-in-line owner? A backup won't have the last version of the
data.

> The reason for this is that the backup owner(s) may not yet have
> received the modification of K.
>
>
OTOH, if the primary owner dies, we have to ask a backup, and we can lose
the modifications not yet replicated by the primary.

> This is a better impl anyway (we discussed this before) becuse it
> generates less traffic; in the normal case, all but 1 GET requests are
> unnecessary.
>
>
I have a WIP branch for this and it seemed to work fine. Test suite speed
seemed about the same, but I didn't get to do a real performance test.

>
>
> Improvement
> -----------
> The above solution can be simplified and even made more efficient.
> Re-using concepts from IRAC [2], we can simply store the modified *keys*
> in the modification queue. The modification replication thread removes
> the key, gets the current value and invokes a PUT/REMOVE on the backup
> owner(s).
>
> Even better: a key is only ever added *once*, so if we have [5,2,17,3],
> adding key 2 is a no-op because the processing of key 2 (in second
> position in the queue) will fetch the up-to-date value anyway !
>
>
> Misc
> ----
> - Could we possibly use total order to send the updates in TO ? TBD
> (Pedro?)
>
>
> Thoughts ?
>
>
> [1] https://github.com/belaban/IspnPerfTest
> [2]
>
> https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140729/a004600e/attachment.html