[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Jul 29 17:14:50 EDT 2014

On Tue, Jul 29, 2014 at 9:06 PM, Sanne Grinovero <sanne at infinispan.org>
wrote:

> This is a nasty problem and I also feel passionately we need to get
> rid of it ASAP.
> I did have the same problems many times, and we discussed this also in
> Farnborough; AFAIR Dan and Pedro had some excellent ideas to fix this.
>
> You don't need TO, and you don't need to lock at all as long as you
> guarantee the backup owners are getting the number with some
> monotonicity sequence attached to it,
> all that backup owners need to do is ignore incoming commands which
> are outdated.
>

This is more or less what TOA does - assign a monotonic sequence number to
txs, and only apply them after all the previous txs in the sequence have
been applied. The problem is getting that monotonic sequence when there are
multiple originators and multiple primary owners also requires some extra
RPCs.

> Another aspect is that the "user thread" on the primary owner needs to
> wait (at least until we improve further) and only proceed after ACK
> from backup nodes, but this is better modelled through a state
> machine. (Also discussed in Farnborough).
>

To be clear, I don't think keeping the user thread on the originator
blocked until we have the write confirmations from all the backups is a
problem - a sync operation has to block, and it also serves to rate-limit
user operations.
The problem appears when the originator is not the primary owner, and the
thread blocking for backup ACKs is from the remote-executor pool (or OOB,
when the remote-executor pool is exhausted).

> It's also conceptually linked to:
>  - https://issues.jboss.org/browse/ISPN-1599
> As you need to separate the locks of entries from the effective user
> facing lock, at least to implement transactions on top of this model.
>

I think we fixed ISPN-1599 when we changed passivation to use
DataContainer.compute(). WDYT Pedro, is there anything else you'd like to
do in the scope of ISPN-1599?

> I expect this to improve performance in a very significant way, but
> it's getting embarrassing that it's still not done; at the next face
> to face meeting we should also reserve some time for retrospective
> sessions.
>

Implementing the state machine-based interceptor stack may give us a
performance boost, but I'm much more certain that it's a very complex, high
risk task... and we don't have a stable test suite yet :)

>
> Sanne
>
> On 29 July 2014 15:50, Bela Ban <bban at redhat.com> wrote:
> >
> >
> > On 29/07/14 16:42, Dan Berindei wrote:
> >> Have you tried regular optimistic/pessimistic transactions as well?
> >
> > Yes, in my first impl. but since I'm making only 1 change per request, I
> > thought a TX is overkill.
> >
> >> They *should* have less issues with the OOB thread pool than non-tx
> mode, and
> >> I'm quite curious how they stack against TO in such a large cluster.
> >
> > Why would they have fewer issues with the thread pools ? AIUI, a TX
> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when not using
> > TXs. And we're sync anyway...
> >
> >
> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban at redhat.com
> >> <mailto:bban at redhat.com>> wrote:
> >>
> >>     Following up on my own email, I changed the config to use Pedro's
> >>     excellent total order implementation:
> >>
> >>     <transaction transactionMode="TRANSACTIONAL"
> >>     transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
> >>     useEagerLocking="true" eagerLockSingleNode="true">
> >>                   <recovery enabled="false"/>
> >>
> >>     With 100 nodes and 25 requester threads/node, I did NOT run into any
> >>     locking issues !
> >>
> >>     I could even go up to 200 requester threads/node and the perf was ~
> >>     7'000-8'000 requests/sec/node. Not too bad !
> >>
> >>     This really validates the concept of lockless total-order
> dissemination
> >>     of TXs; for the first time, this has been tested on a large(r) scale
> >>     (previously only on 25 nodes) and IT WORKS ! :-)
> >>
> >>     I still believe we should implement my suggested solution for non-TO
> >>     configs, but short of configuring thread pools of 1000 threads or
> >>     higher, I hope TO will allow me to finally test a 500 node
> Infinispan
> >>     cluster !
> >>
> >>
> >>     On 29/07/14 15:56, Bela Ban wrote:
> >>      > Hi guys,
> >>      >
> >>      > sorry for the long post, but I do think I ran into an important
> >>     problem
> >>      > and we need to fix it ... :-)
> >>      >
> >>      > I've spent the last couple of days running the IspnPerfTest [1]
> >>     perftest
> >>      > on Google Compute Engine (GCE), and I've run into a problem with
> >>      > Infinispan. It is a design problem and can be mitigated by sizing
> >>     thread
> >>      > pools correctly, but cannot be eliminated entirely.
> >>      >
> >>      >
> >>      > Symptom:
> >>      > --------
> >>      > IspnPerfTest has every node in a cluster perform 20'000 requests
> >>     on keys
> >>      > in range [1..20000].
> >>      >
> >>      > 80% of the requests are reads and 20% writes.
> >>      >
> >>      > By default, we have 25 requester threads per node and 100 nodes
> in a
> >>      > cluster, so a total of 2500 requester threads.
> >>      >
> >>      > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:
> >>      >
> >>      > <namedCache name="clusteredCache">
> >>      >        <clustering mode="distribution">
> >>      >            <stateTransfer awaitInitialTransfer="true"/>
> >>      >            <hash numOwners="2"/>
> >>      >            <sync replTimeout="20000"/>
> >>      >        </clustering>
> >>      >
> >>      >        <transaction transactionMode="NON_TRANSACTIONAL"
> >>      > useEagerLocking="true"
> >>      >             eagerLockSingleNode="true"  />
> >>      >        <locking lockAcquisitionTimeout="5000"
> concurrencyLevel="1000"
> >>      >                 isolationLevel="READ_COMMITTED"
> >>     useLockStriping="false" />
> >>      > </namedCache>
> >>      >
> >>      > It has 2 owners, a lock acquisition timeout of 5s and a repl
> >>     timeout of
> >>      > 20s. Lock stripting is off, so we have 1 lock per key.
> >>      >
> >>      > When I run the test, I always get errors like those below:
> >>      >
> >>      > org.infinispan.util.concurrent.TimeoutException: Unable to
> >>     acquire lock
> >>      > after [10 seconds] on key [19386] for requestor
> >>     [Thread[invoker-3,5,main]]!
> >>      > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
> >>      >
> >>      > and
> >>      >
> >>      > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed
> out
> >>      >
> >>      >
> >>      > Investigation:
> >>      > ------------
> >>      > When I looked at UNICAST3, I saw a lot of missing messages on the
> >>      > receive side and unacked messages on the send side. This caused
> me to
> >>      > look into the (mainly OOB) thread pools and - voila - maxed out !
> >>      >
> >>      > I learned from Pedro that the Infinispan internal thread pool
> (with a
> >>      > default of 32 threads) can be configured, so I increased it to
> >>     300 and
> >>      > increased the OOB pools as well.
> >>      >
> >>      > This mitigated the problem somewhat, but when I increased the
> >>     requester
> >>      > threads to 100, I had the same problem again. Apparently, the
> >>     Infinispan
> >>      > internal thread pool uses a rejection policy of "run" and thus
> >>     uses the
> >>      > JGroups (OOB) thread when exhausted.
> >>      >
> >>      > I learned (from Pedro and Mircea) that GETs and PUTs work as
> >>     follows in
> >>      > dist-sync / 2 owners:
> >>      > - GETs are sent to the primary and backup owners and the first
> >>     response
> >>      > received is returned to the caller. No locks are acquired, so
> GETs
> >>      > shouldn't cause problems.
> >>      >
> >>      > - A PUT(K) is sent to the primary owner of K
> >>      > - The primary owner
> >>      >        (1) locks K
> >>      >        (2) updates the backup owner synchronously *while holding
> >>     the lock*
> >>      >        (3) releases the lock
> >>      >
> >>      >
> >>      > Hypothesis
> >>      > ----------
> >>      > (2) above is done while holding the lock. The sync update of the
> >>     backup
> >>      > owner is done with the lock held to guarantee that the primary
> and
> >>      > backup owner of K have the same values for K.
> >>      >
> >>      > However, the sync update *inside the lock scope* slows things
> >>     down (can
> >>      > it also lead to deadlocks?); there's the risk that the request is
> >>      > dropped due to a full incoming thread pool, or that the response
> >>     is not
> >>      > received because of the same, or that the locking at the backup
> owner
> >>      > blocks for some time.
> >>      >
> >>      > If we have many threads modifying the same key, then we have a
> >>     backlog
> >>      > of locking work against that key. Say we have 100 requester
> >>     threads and
> >>      > a 100 node cluster. This means that we have 10'000 threads
> accessing
> >>      > keys; with 2'000 writers there's a big chance that some writers
> >>     pick the
> >>      > same key at the same time.
> >>      >
> >>      > For example, if we have 100 threads accessing key K and it takes
> >>     3ms to
> >>      > replicate K to the backup owner, then the last of the 100 threads
> >>     waits
> >>      > ~300ms before it gets a chance to lock K on the primary owner and
> >>      > replicate it as well.
> >>      >
> >>      > Just a small hiccup in sending the PUT to the primary owner,
> >>     sending the
> >>      > modification to the backup owner, waitting for the response, or
> >>     GC, and
> >>      > the delay will quickly become bigger.
> >>      >
> >>      >
> >>      > Verification
> >>      > ----------
> >>      > To verify the above, I set numOwners to 1. This means that the
> >>     primary
> >>      > owner of K does *not* send the modification to the backup owner,
> >>     it only
> >>      > locks K, modifies K and unlocks K again.
> >>      >
> >>      > I ran the IspnPerfTest again on 100 nodes, with 25 requesters,
> and NO
> >>      > PROBLEM !
> >>      >
> >>      > I then increased the requesters to 100, 150 and 200 and the test
> >>      > completed flawlessly ! Performance was around *40'000 requests
> >>     per node
> >>      > per sec* on 4-core boxes !
> >>      >
> >>      >
> >>      > Root cause
> >>      > ---------
> >>      > *******************
> >>      > The root cause is the sync RPC of K to the backup owner(s) of K
> while
> >>      > the primary owner holds the lock for K.
> >>      > *******************
> >>      >
> >>      > This causes a backlog of threads waiting for the lock and that
> >>     backlog
> >>      > can grow to exhaust the thread pools. First the Infinispan
> internal
> >>      > thread pool, then the JGroups OOB thread pool. The latter causes
> >>      > retransmissions to get dropped, which compounds the problem...
> >>      >
> >>      >
> >>      > Goal
> >>      > ----
> >>      > The goal is to make sure that primary and backup owner(s) of K
> >>     have the
> >>      > same value for K.
> >>      >
> >>      > Simply sending the modification to the backup owner(s)
> asynchronously
> >>      > won't guarantee this, as modification messages might get
> >>     processed out
> >>      > of order as they're OOB !
> >>      >
> >>      >
> >>      > Suggested solution
> >>      > ----------------
> >>      > The modification RPC needs to be invoked *outside of the lock
> scope*:
> >>      > - lock K
> >>      > - modify K
> >>      > - unlock K
> >>      > - send modification to backup owner(s) // outside the lock scope
> >>      >
> >>      > The primary owner puts the modification of K into a queue from
> >>     where a
> >>      > separate thread/task removes it. The thread then invokes the
> >>     PUT(K) on
> >>      > the backup owner(s).
> >>      >
> >>      > The queue has the modified keys in FIFO order, so the
> modifications
> >>      > arrive at the backup owner(s) in the right order.
> >>      >
> >>      > This requires that the way GET is implemented changes slightly:
> >>     instead
> >>      > of invoking a GET on all owners of K, we only invoke it on the
> >>     primary
> >>      > owner, then the next-in-line etc.
> >>      >
> >>      > The reason for this is that the backup owner(s) may not yet have
> >>      > received the modification of K.
> >>      >
> >>      > This is a better impl anyway (we discussed this before) becuse it
> >>      > generates less traffic; in the normal case, all but 1 GET
> >>     requests are
> >>      > unnecessary.
> >>      >
> >>      >
> >>      >
> >>      > Improvement
> >>      > -----------
> >>      > The above solution can be simplified and even made more
> efficient.
> >>      > Re-using concepts from IRAC [2], we can simply store the modified
> >>     *keys*
> >>      > in the modification queue. The modification replication thread
> >>     removes
> >>      > the key, gets the current value and invokes a PUT/REMOVE on the
> >>     backup
> >>      > owner(s).
> >>      >
> >>      > Even better: a key is only ever added *once*, so if we have
> >>     [5,2,17,3],
> >>      > adding key 2 is a no-op because the processing of key 2 (in
> second
> >>      > position in the queue) will fetch the up-to-date value anyway !
> >>      >
> >>      >
> >>      > Misc
> >>      > ----
> >>      > - Could we possibly use total order to send the updates in TO ?
> >>     TBD (Pedro?)
> >>      >
> >>      >
> >>      > Thoughts ?
> >>      >
> >>      >
> >>      > [1] https://github.com/belaban/IspnPerfTest
> >>      > [2]
> >>      >
> >>
> https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
> >>      >
> >>
> >>     --
> >>     Bela Ban, JGroups lead (http://www.jgroups.org)
> >>     _______________________________________________
> >>     infinispan-dev mailing list
> >>     infinispan-dev at lists.jboss.org <mailto:
> infinispan-dev at lists.jboss.org>
> >>     https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> infinispan-dev at lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >>
> >
> > --
> > Bela Ban, JGroups lead (http://www.jgroups.org)
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140730/9cb33afd/attachment-0001.html