[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Jul 29 10:50:27 EDT 2014

On 29/07/14 16:42, Dan Berindei wrote:
> Have you tried regular optimistic/pessimistic transactions as well?

Yes, in my first impl. but since I'm making only 1 change per request, I 
thought a TX is overkill.

> They *should* have less issues with the OOB thread pool than non-tx mode, and
> I'm quite curious how they stack against TO in such a large cluster.

Why would they have fewer issues with the thread pools ? AIUI, a TX 
involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when not using 
TXs. And we're sync anyway...

> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban at redhat.com
> <mailto:bban at redhat.com>> wrote:
>
>     Following up on my own email, I changed the config to use Pedro's
>     excellent total order implementation:
>
>     <transaction transactionMode="TRANSACTIONAL"
>     transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
>     useEagerLocking="true" eagerLockSingleNode="true">
>                   <recovery enabled="false"/>
>
>     With 100 nodes and 25 requester threads/node, I did NOT run into any
>     locking issues !
>
>     I could even go up to 200 requester threads/node and the perf was ~
>     7'000-8'000 requests/sec/node. Not too bad !
>
>     This really validates the concept of lockless total-order dissemination
>     of TXs; for the first time, this has been tested on a large(r) scale
>     (previously only on 25 nodes) and IT WORKS ! :-)
>
>     I still believe we should implement my suggested solution for non-TO
>     configs, but short of configuring thread pools of 1000 threads or
>     higher, I hope TO will allow me to finally test a 500 node Infinispan
>     cluster !
>
>
>     On 29/07/14 15:56, Bela Ban wrote:
>      > Hi guys,
>      >
>      > sorry for the long post, but I do think I ran into an important
>     problem
>      > and we need to fix it ... :-)
>      >
>      > I've spent the last couple of days running the IspnPerfTest [1]
>     perftest
>      > on Google Compute Engine (GCE), and I've run into a problem with
>      > Infinispan. It is a design problem and can be mitigated by sizing
>     thread
>      > pools correctly, but cannot be eliminated entirely.
>      >
>      >
>      > Symptom:
>      > --------
>      > IspnPerfTest has every node in a cluster perform 20'000 requests
>     on keys
>      > in range [1..20000].
>      >
>      > 80% of the requests are reads and 20% writes.
>      >
>      > By default, we have 25 requester threads per node and 100 nodes in a
>      > cluster, so a total of 2500 requester threads.
>      >
>      > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:
>      >
>      > <namedCache name="clusteredCache">
>      >        <clustering mode="distribution">
>      >            <stateTransfer awaitInitialTransfer="true"/>
>      >            <hash numOwners="2"/>
>      >            <sync replTimeout="20000"/>
>      >        </clustering>
>      >
>      >        <transaction transactionMode="NON_TRANSACTIONAL"
>      > useEagerLocking="true"
>      >             eagerLockSingleNode="true"  />
>      >        <locking lockAcquisitionTimeout="5000" concurrencyLevel="1000"
>      >                 isolationLevel="READ_COMMITTED"
>     useLockStriping="false" />
>      > </namedCache>
>      >
>      > It has 2 owners, a lock acquisition timeout of 5s and a repl
>     timeout of
>      > 20s. Lock stripting is off, so we have 1 lock per key.
>      >
>      > When I run the test, I always get errors like those below:
>      >
>      > org.infinispan.util.concurrent.TimeoutException: Unable to
>     acquire lock
>      > after [10 seconds] on key [19386] for requestor
>     [Thread[invoker-3,5,main]]!
>      > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
>      >
>      > and
>      >
>      > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed out
>      >
>      >
>      > Investigation:
>      > ------------
>      > When I looked at UNICAST3, I saw a lot of missing messages on the
>      > receive side and unacked messages on the send side. This caused me to
>      > look into the (mainly OOB) thread pools and - voila - maxed out !
>      >
>      > I learned from Pedro that the Infinispan internal thread pool (with a
>      > default of 32 threads) can be configured, so I increased it to
>     300 and
>      > increased the OOB pools as well.
>      >
>      > This mitigated the problem somewhat, but when I increased the
>     requester
>      > threads to 100, I had the same problem again. Apparently, the
>     Infinispan
>      > internal thread pool uses a rejection policy of "run" and thus
>     uses the
>      > JGroups (OOB) thread when exhausted.
>      >
>      > I learned (from Pedro and Mircea) that GETs and PUTs work as
>     follows in
>      > dist-sync / 2 owners:
>      > - GETs are sent to the primary and backup owners and the first
>     response
>      > received is returned to the caller. No locks are acquired, so GETs
>      > shouldn't cause problems.
>      >
>      > - A PUT(K) is sent to the primary owner of K
>      > - The primary owner
>      >        (1) locks K
>      >        (2) updates the backup owner synchronously *while holding
>     the lock*
>      >        (3) releases the lock
>      >
>      >
>      > Hypothesis
>      > ----------
>      > (2) above is done while holding the lock. The sync update of the
>     backup
>      > owner is done with the lock held to guarantee that the primary and
>      > backup owner of K have the same values for K.
>      >
>      > However, the sync update *inside the lock scope* slows things
>     down (can
>      > it also lead to deadlocks?); there's the risk that the request is
>      > dropped due to a full incoming thread pool, or that the response
>     is not
>      > received because of the same, or that the locking at the backup owner
>      > blocks for some time.
>      >
>      > If we have many threads modifying the same key, then we have a
>     backlog
>      > of locking work against that key. Say we have 100 requester
>     threads and
>      > a 100 node cluster. This means that we have 10'000 threads accessing
>      > keys; with 2'000 writers there's a big chance that some writers
>     pick the
>      > same key at the same time.
>      >
>      > For example, if we have 100 threads accessing key K and it takes
>     3ms to
>      > replicate K to the backup owner, then the last of the 100 threads
>     waits
>      > ~300ms before it gets a chance to lock K on the primary owner and
>      > replicate it as well.
>      >
>      > Just a small hiccup in sending the PUT to the primary owner,
>     sending the
>      > modification to the backup owner, waitting for the response, or
>     GC, and
>      > the delay will quickly become bigger.
>      >
>      >
>      > Verification
>      > ----------
>      > To verify the above, I set numOwners to 1. This means that the
>     primary
>      > owner of K does *not* send the modification to the backup owner,
>     it only
>      > locks K, modifies K and unlocks K again.
>      >
>      > I ran the IspnPerfTest again on 100 nodes, with 25 requesters, and NO
>      > PROBLEM !
>      >
>      > I then increased the requesters to 100, 150 and 200 and the test
>      > completed flawlessly ! Performance was around *40'000 requests
>     per node
>      > per sec* on 4-core boxes !
>      >
>      >
>      > Root cause
>      > ---------
>      > *******************
>      > The root cause is the sync RPC of K to the backup owner(s) of K while
>      > the primary owner holds the lock for K.
>      > *******************
>      >
>      > This causes a backlog of threads waiting for the lock and that
>     backlog
>      > can grow to exhaust the thread pools. First the Infinispan internal
>      > thread pool, then the JGroups OOB thread pool. The latter causes
>      > retransmissions to get dropped, which compounds the problem...
>      >
>      >
>      > Goal
>      > ----
>      > The goal is to make sure that primary and backup owner(s) of K
>     have the
>      > same value for K.
>      >
>      > Simply sending the modification to the backup owner(s) asynchronously
>      > won't guarantee this, as modification messages might get
>     processed out
>      > of order as they're OOB !
>      >
>      >
>      > Suggested solution
>      > ----------------
>      > The modification RPC needs to be invoked *outside of the lock scope*:
>      > - lock K
>      > - modify K
>      > - unlock K
>      > - send modification to backup owner(s) // outside the lock scope
>      >
>      > The primary owner puts the modification of K into a queue from
>     where a
>      > separate thread/task removes it. The thread then invokes the
>     PUT(K) on
>      > the backup owner(s).
>      >
>      > The queue has the modified keys in FIFO order, so the modifications
>      > arrive at the backup owner(s) in the right order.
>      >
>      > This requires that the way GET is implemented changes slightly:
>     instead
>      > of invoking a GET on all owners of K, we only invoke it on the
>     primary
>      > owner, then the next-in-line etc.
>      >
>      > The reason for this is that the backup owner(s) may not yet have
>      > received the modification of K.
>      >
>      > This is a better impl anyway (we discussed this before) becuse it
>      > generates less traffic; in the normal case, all but 1 GET
>     requests are
>      > unnecessary.
>      >
>      >
>      >
>      > Improvement
>      > -----------
>      > The above solution can be simplified and even made more efficient.
>      > Re-using concepts from IRAC [2], we can simply store the modified
>     *keys*
>      > in the modification queue. The modification replication thread
>     removes
>      > the key, gets the current value and invokes a PUT/REMOVE on the
>     backup
>      > owner(s).
>      >
>      > Even better: a key is only ever added *once*, so if we have
>     [5,2,17,3],
>      > adding key 2 is a no-op because the processing of key 2 (in second
>      > position in the queue) will fetch the up-to-date value anyway !
>      >
>      >
>      > Misc
>      > ----
>      > - Could we possibly use total order to send the updates in TO ?
>     TBD (Pedro?)
>      >
>      >
>      > Thoughts ?
>      >
>      >
>      > [1] https://github.com/belaban/IspnPerfTest
>      > [2]
>      >
>     https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
>      >
>
>     --
>     Bela Ban, JGroups lead (http://www.jgroups.org)
>     _______________________________________________
>     infinispan-dev mailing list
>     infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
>     https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)