[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Jul 29 16:29:29 EDT 2014

On Tue, Jul 29, 2014 at 5:50 PM, Bela Ban <bban at redhat.com> wrote:

>
>
> On 29/07/14 16:42, Dan Berindei wrote:
> > Have you tried regular optimistic/pessimistic transactions as well?
>
> Yes, in my first impl. but since I'm making only 1 change per request, I
> thought a TX is overkill.
>

You are using txs with TO, right?

>
> > They *should* have less issues with the OOB thread pool than non-tx
> mode, and
> > I'm quite curious how they stack against TO in such a large cluster.
>
> Why would they have fewer issues with the thread pools ? AIUI, a TX
> involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when not using
> TXs. And we're sync anyway...
>
>
Actually, 2 sync RPCs (prepare + commit) and 1 async RPC (tx completion
notification). But we only keep the user thread busy across RPCs (unless L1
is enabled), so we actually need less OOB/remote threads.

>
> > On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban at redhat.com
> > <mailto:bban at redhat.com>> wrote:
> >
> >     Following up on my own email, I changed the config to use Pedro's
> >     excellent total order implementation:
> >
> >     <transaction transactionMode="TRANSACTIONAL"
> >     transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
> >     useEagerLocking="true" eagerLockSingleNode="true">
> >                   <recovery enabled="false"/>
> >
> >     With 100 nodes and 25 requester threads/node, I did NOT run into any
> >     locking issues !
> >
> >     I could even go up to 200 requester threads/node and the perf was ~
> >     7'000-8'000 requests/sec/node. Not too bad !
> >
> >     This really validates the concept of lockless total-order
> dissemination
> >     of TXs; for the first time, this has been tested on a large(r) scale
> >     (previously only on 25 nodes) and IT WORKS ! :-)
> >
> >     I still believe we should implement my suggested solution for non-TO
> >     configs, but short of configuring thread pools of 1000 threads or
> >     higher, I hope TO will allow me to finally test a 500 node Infinispan
> >     cluster !
> >
> >
> >     On 29/07/14 15:56, Bela Ban wrote:
> >      > Hi guys,
> >      >
> >      > sorry for the long post, but I do think I ran into an important
> >     problem
> >      > and we need to fix it ... :-)
> >      >
> >      > I've spent the last couple of days running the IspnPerfTest [1]
> >     perftest
> >      > on Google Compute Engine (GCE), and I've run into a problem with
> >      > Infinispan. It is a design problem and can be mitigated by sizing
> >     thread
> >      > pools correctly, but cannot be eliminated entirely.
> >      >
> >      >
> >      > Symptom:
> >      > --------
> >      > IspnPerfTest has every node in a cluster perform 20'000 requests
> >     on keys
> >      > in range [1..20000].
> >      >
> >      > 80% of the requests are reads and 20% writes.
> >      >
> >      > By default, we have 25 requester threads per node and 100 nodes
> in a
> >      > cluster, so a total of 2500 requester threads.
> >      >
> >      > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:
> >      >
> >      > <namedCache name="clusteredCache">
> >      >        <clustering mode="distribution">
> >      >            <stateTransfer awaitInitialTransfer="true"/>
> >      >            <hash numOwners="2"/>
> >      >            <sync replTimeout="20000"/>
> >      >        </clustering>
> >      >
> >      >        <transaction transactionMode="NON_TRANSACTIONAL"
> >      > useEagerLocking="true"
> >      >             eagerLockSingleNode="true"  />
> >      >        <locking lockAcquisitionTimeout="5000"
> concurrencyLevel="1000"
> >      >                 isolationLevel="READ_COMMITTED"
> >     useLockStriping="false" />
> >      > </namedCache>
> >      >
> >      > It has 2 owners, a lock acquisition timeout of 5s and a repl
> >     timeout of
> >      > 20s. Lock stripting is off, so we have 1 lock per key.
> >      >
> >      > When I run the test, I always get errors like those below:
> >      >
> >      > org.infinispan.util.concurrent.TimeoutException: Unable to
> >     acquire lock
> >      > after [10 seconds] on key [19386] for requestor
> >     [Thread[invoker-3,5,main]]!
> >      > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
> >      >
> >      > and
> >      >
> >      > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed
> out
> >      >
> >      >
> >      > Investigation:
> >      > ------------
> >      > When I looked at UNICAST3, I saw a lot of missing messages on the
> >      > receive side and unacked messages on the send side. This caused
> me to
> >      > look into the (mainly OOB) thread pools and - voila - maxed out !
> >      >
> >      > I learned from Pedro that the Infinispan internal thread pool
> (with a
> >      > default of 32 threads) can be configured, so I increased it to
> >     300 and
> >      > increased the OOB pools as well.
> >      >
> >      > This mitigated the problem somewhat, but when I increased the
> >     requester
> >      > threads to 100, I had the same problem again. Apparently, the
> >     Infinispan
> >      > internal thread pool uses a rejection policy of "run" and thus
> >     uses the
> >      > JGroups (OOB) thread when exhausted.
> >      >
> >      > I learned (from Pedro and Mircea) that GETs and PUTs work as
> >     follows in
> >      > dist-sync / 2 owners:
> >      > - GETs are sent to the primary and backup owners and the first
> >     response
> >      > received is returned to the caller. No locks are acquired, so GETs
> >      > shouldn't cause problems.
> >      >
> >      > - A PUT(K) is sent to the primary owner of K
> >      > - The primary owner
> >      >        (1) locks K
> >      >        (2) updates the backup owner synchronously *while holding
> >     the lock*
> >      >        (3) releases the lock
> >      >
> >      >
> >      > Hypothesis
> >      > ----------
> >      > (2) above is done while holding the lock. The sync update of the
> >     backup
> >      > owner is done with the lock held to guarantee that the primary and
> >      > backup owner of K have the same values for K.
> >      >
> >      > However, the sync update *inside the lock scope* slows things
> >     down (can
> >      > it also lead to deadlocks?); there's the risk that the request is
> >      > dropped due to a full incoming thread pool, or that the response
> >     is not
> >      > received because of the same, or that the locking at the backup
> owner
> >      > blocks for some time.
> >      >
> >      > If we have many threads modifying the same key, then we have a
> >     backlog
> >      > of locking work against that key. Say we have 100 requester
> >     threads and
> >      > a 100 node cluster. This means that we have 10'000 threads
> accessing
> >      > keys; with 2'000 writers there's a big chance that some writers
> >     pick the
> >      > same key at the same time.
> >      >
> >      > For example, if we have 100 threads accessing key K and it takes
> >     3ms to
> >      > replicate K to the backup owner, then the last of the 100 threads
> >     waits
> >      > ~300ms before it gets a chance to lock K on the primary owner and
> >      > replicate it as well.
> >      >
> >      > Just a small hiccup in sending the PUT to the primary owner,
> >     sending the
> >      > modification to the backup owner, waitting for the response, or
> >     GC, and
> >      > the delay will quickly become bigger.
> >      >
> >      >
> >      > Verification
> >      > ----------
> >      > To verify the above, I set numOwners to 1. This means that the
> >     primary
> >      > owner of K does *not* send the modification to the backup owner,
> >     it only
> >      > locks K, modifies K and unlocks K again.
> >      >
> >      > I ran the IspnPerfTest again on 100 nodes, with 25 requesters,
> and NO
> >      > PROBLEM !
> >      >
> >      > I then increased the requesters to 100, 150 and 200 and the test
> >      > completed flawlessly ! Performance was around *40'000 requests
> >     per node
> >      > per sec* on 4-core boxes !
> >      >
> >      >
> >      > Root cause
> >      > ---------
> >      > *******************
> >      > The root cause is the sync RPC of K to the backup owner(s) of K
> while
> >      > the primary owner holds the lock for K.
> >      > *******************
> >      >
> >      > This causes a backlog of threads waiting for the lock and that
> >     backlog
> >      > can grow to exhaust the thread pools. First the Infinispan
> internal
> >      > thread pool, then the JGroups OOB thread pool. The latter causes
> >      > retransmissions to get dropped, which compounds the problem...
> >      >
> >      >
> >      > Goal
> >      > ----
> >      > The goal is to make sure that primary and backup owner(s) of K
> >     have the
> >      > same value for K.
> >      >
> >      > Simply sending the modification to the backup owner(s)
> asynchronously
> >      > won't guarantee this, as modification messages might get
> >     processed out
> >      > of order as they're OOB !
> >      >
> >      >
> >      > Suggested solution
> >      > ----------------
> >      > The modification RPC needs to be invoked *outside of the lock
> scope*:
> >      > - lock K
> >      > - modify K
> >      > - unlock K
> >      > - send modification to backup owner(s) // outside the lock scope
> >      >
> >      > The primary owner puts the modification of K into a queue from
> >     where a
> >      > separate thread/task removes it. The thread then invokes the
> >     PUT(K) on
> >      > the backup owner(s).
> >      >
> >      > The queue has the modified keys in FIFO order, so the
> modifications
> >      > arrive at the backup owner(s) in the right order.
> >      >
> >      > This requires that the way GET is implemented changes slightly:
> >     instead
> >      > of invoking a GET on all owners of K, we only invoke it on the
> >     primary
> >      > owner, then the next-in-line etc.
> >      >
> >      > The reason for this is that the backup owner(s) may not yet have
> >      > received the modification of K.
> >      >
> >      > This is a better impl anyway (we discussed this before) becuse it
> >      > generates less traffic; in the normal case, all but 1 GET
> >     requests are
> >      > unnecessary.
> >      >
> >      >
> >      >
> >      > Improvement
> >      > -----------
> >      > The above solution can be simplified and even made more efficient.
> >      > Re-using concepts from IRAC [2], we can simply store the modified
> >     *keys*
> >      > in the modification queue. The modification replication thread
> >     removes
> >      > the key, gets the current value and invokes a PUT/REMOVE on the
> >     backup
> >      > owner(s).
> >      >
> >      > Even better: a key is only ever added *once*, so if we have
> >     [5,2,17,3],
> >      > adding key 2 is a no-op because the processing of key 2 (in second
> >      > position in the queue) will fetch the up-to-date value anyway !
> >      >
> >      >
> >      > Misc
> >      > ----
> >      > - Could we possibly use total order to send the updates in TO ?
> >     TBD (Pedro?)
> >      >
> >      >
> >      > Thoughts ?
> >      >
> >      >
> >      > [1] https://github.com/belaban/IspnPerfTest
> >      > [2]
> >      >
> >
> https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
> >      >
> >
> >     --
> >     Bela Ban, JGroups lead (http://www.jgroups.org)
> >     _______________________________________________
> >     infinispan-dev mailing list
> >     infinispan-dev at lists.jboss.org <mailto:
> infinispan-dev at lists.jboss.org>
> >     https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140729/5513b6ab/attachment-0001.html