Re: [infinispan-dev] DIST-SYNC, put(), a problem and a solution

Wednesday, 30 July 2014

On Wed, Jul 30, 2014 at 12:00 PM, Pedro Ruivo <pedro(a)infinispan.org&gt; wrote:

...

 On 07/30/2014 09:02 AM, Dan Berindei wrote:
 >

 >
 > if your proposal is only meant to apply to non-tx caches, you are right
 > you don't have to worry about multiple primary owners... most of the
 > time. But when the primary owner changes, then you do have 2 primary
 > owners (if the new primary owner installs the new topology first), and
 > you do need to coordinate between the 2.
 >

 I think it is the same for transactional cache. I.e. the commands wait
 for the transaction data from the new topology to be installed. In the
 non-tx caches, the old primary owner will send the next "sequence
 number" to the new primary owner and only after that, the new primary
 owner starts to give the orders.

I'm not sure that's related, commands that wait for a newer topology do not
block a thread since the ISPN-3527 fix.

...

 Otherwise, I can implement a total order version for non-tx caches and
 all the write serialization would be done in JGroups and Infinispan only
 has to apply the updates as soon as they are delivered.

Right, that sounds quite interesting. But you'd also need a less-blocking
state transfer ;)

...
 > Slightly related: we also considered generating a version number
on the
 > client for consistency when the HotRod client retries after a primary
 > owner failure [1]. But the clients can't create a monotonic sequence
 > number, so we couldn't use that version number for this.
 >
 > [1] https://issues.jboss.org/browse/ISPN-2956
 >
 >
 >     Also I don't see it as an alternative to TOA, I rather expect it to
 >     work nicely together: when TOA is enabled you could trust the
 >     originating sequence source rather than generate a per-entry
 sequence,
 >     and in neither case you need to actually use a Lock.
 >     I haven't thought how the sequences would need to interact (if they
 >     need), but they seem complementary to resolve different aspects, and
 >     also both benefit from the same cleanup and basic structure.
 >
 >
 > We don't acquire locks at all on the backup owners - either in tx or
 > non-tx caches. If state transfer is in progress, we use
 > ConcurrentHashMap.compute() to store tracking information, which uses a
 > synchronized block, so I suppose we do acquire locks. I assume your
 > proposal would require a DataContainer.compute() or something similar on
 > the backups, to ensure that the version check and the replacement are
 > atomic.
 >
 > I still think TOA does what you want for tx caches. Your proposal would
 > only work for non-tx caches, so you couldn't use them together.
 >
 >
 >      >> Another aspect is that the "user thread" on the primary
owner
 >     needs to
 >      >> wait (at least until we improve further) and only proceed after
 ACK
 >      >> from backup nodes, but this is better modelled through a state
 >      >> machine. (Also discussed in Farnborough).
 >      >
 >      >
 >      > To be clear, I don't think keeping the user thread on the
 >     originator blocked
 >      > until we have the write confirmations from all the backups is a
 >     problem - a
 >      > sync operation has to block, and it also serves to rate-limit user
 >      > operations.
 >
 >
 >     There are better ways to rate-limit than to make all operations slow;
 >     we don't need to block a thread, we need to react on the reply from
 >     the backup owners.
 >     You still have an inherent rate-limit in the outgoing packet queues:
 >     if these fill up, then and only then it's nice to introduce some back
 >     pressure.
 >
 >
 > Sorry, you got me confused when you called the thread on the primary
 > owner a "user thread". I agree that internal stuff can and should be
 > asynchronous, callback based, but the user still has to see a
 > synchronous blocking operation.
 >
 >
 >      > The problem appears when the originator is not the primary owner,
 >     and the
 >      > thread blocking for backup ACKs is from the remote-executor pool
 >     (or OOB,
 >      > when the remote-executor pool is exhausted).
 >
 >     Not following. I guess this is out of scope now that I clarified the
 >     proposed solution is only to be applied between primary and backups?
 >
 >
 > Yeah, I was just trying to clarify that there is no danger of exhausting
 > the remote executor/OOB thread pools when the originator of the write
 > command is the primary owner (as it happens in the HotRod server).
 >
 >
 >      >>
 >      >> It's also conceptually linked to:
 >      >>  - https://issues.jboss.org/browse/ISPN-1599
 >      >> As you need to separate the locks of entries from the effective
 user
 >      >> facing lock, at least to implement transactions on top of this
 >     model.
 >      >
 >      >
 >      > I think we fixed ISPN-1599 when we changed passivation to use
 >      > DataContainer.compute(). WDYT Pedro, is there anything else you'd
 >     like to do
 >      > in the scope of ISPN-1599?
 >      >
 >      >>
 >      >> I expect this to improve performance in a very significant way,
 but
 >      >> it's getting embarrassing that it's still not done; at the
next
 face
 >      >> to face meeting we should also reserve some time for
 retrospective
 >      >> sessions.
 >      >
 >      >
 >      > Implementing the state machine-based interceptor stack may give
 us a
 >      > performance boost, but I'm much more certain that it's a very
 >     complex, high
 >      > risk task... and we don't have a stable test suite yet :)
 >
 >     Cleaning up and removing some complexity such as
 >     TooManyExecutorsException might help to get it stable, and keep it
 >     there :)
 >     BTW it was quite stable for me until you changed the JGroups UDP
 >     default configuration.
 >
 >
 > Do you really use UDP to run the tests? The default is TCP, but maybe
 > the some tests doesn't use TestCacheManagerFactory...
 >
 > I was just aligning our configs with Bela's recommandations: MERGE3
 > instead of MERGE2 and the removal of UFC in TCP stacks. If they cause
 > problems on your machine, you should make more noise :)
 >
 > Dan
 >
 >     Sanne
 >
 >      >
 >      >
 >      >>
 >      >>
 >      >> Sanne
 >      >>
 >      >> On 29 July 2014 15:50, Bela Ban <bban(a)redhat.com
 >     <mailto:bban@redhat.com>> wrote:
 >      >> >
 >      >> >
 >      >> > On 29/07/14 16:42, Dan Berindei wrote:
 >      >> >> Have you tried regular optimistic/pessimistic transactions
as
 >     well?
 >      >> >
 >      >> > Yes, in my first impl. but since I'm making only 1 change
per
 >     request, I
 >      >> > thought a TX is overkill.
 >      >> >
 >      >> >> They *should* have less issues with the OOB thread pool than
 >     non-tx
 >      >> >> mode, and
 >      >> >> I'm quite curious how they stack against TO in such a
large
 >     cluster.
 >      >> >
 >      >> > Why would they have fewer issues with the thread pools ? AIUI,
 >     a TX
 >      >> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when
 >     not using
 >      >> > TXs. And we're sync anyway...
 >      >> >
 >      >> >
 >      >> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban
<bban(a)redhat.com
 >     <mailto:bban@redhat.com>
 >      >> >> <mailto:bban@redhat.com
<mailto:bban@redhat.com>>> wrote:
 >      >> >>
 >      >> >>     Following up on my own email, I changed the config to
use
 >     Pedro's
 >      >> >>     excellent total order implementation:
 >      >> >>
 >      >> >>     <transaction
transactionMode="TRANSACTIONAL"
 >      >> >>     transactionProtocol="TOTAL_ORDER"
lockingMode="OPTIMISTIC"
 >      >> >>     useEagerLocking="true"
eagerLockSingleNode="true">
 >      >> >>                   <recovery
enabled="false"/>
 >      >> >>
 >      >> >>     With 100 nodes and 25 requester threads/node, I did NOT
 >     run into
 >      >> >> any
 >      >> >>     locking issues !
 >      >> >>
 >      >> >>     I could even go up to 200 requester threads/node and the
 >     perf was ~
 >      >> >>     7'000-8'000 requests/sec/node. Not too bad !
 >      >> >>
 >      >> >>     This really validates the concept of lockless
total-order
 >      >> >> dissemination
 >      >> >>     of TXs; for the first time, this has been tested on a
 >     large(r)
 >      >> >> scale
 >      >> >>     (previously only on 25 nodes) and IT WORKS ! :-)
 >      >> >>
 >      >> >>     I still believe we should implement my suggested
solution
 for
 >      >> >> non-TO
 >      >> >>     configs, but short of configuring thread pools of 1000
 >     threads or
 >      >> >>     higher, I hope TO will allow me to finally test a 500
node
 >      >> >> Infinispan
 >      >> >>     cluster !
 >      >> >>
 >      >> >>
 >      >> >>     On 29/07/14 15:56, Bela Ban wrote:
 >      >> >>      > Hi guys,
 >      >> >>      >
 >      >> >>      > sorry for the long post, but I do think I ran into
an
 >     important
 >      >> >>     problem
 >      >> >>      > and we need to fix it ... :-)
 >      >> >>      >
 >      >> >>      > I've spent the last couple of days running the
 >     IspnPerfTest [1]
 >      >> >>     perftest
 >      >> >>      > on Google Compute Engine (GCE), and I've run
into a
 >     problem with
 >      >> >>      > Infinispan. It is a design problem and can be
 mitigated by
 >      >> >> sizing
 >      >> >>     thread
 >      >> >>      > pools correctly, but cannot be eliminated
entirely.
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Symptom:
 >      >> >>      > --------
 >      >> >>      > IspnPerfTest has every node in a cluster perform
 >     20'000 requests
 >      >> >>     on keys
 >      >> >>      > in range [1..20000].
 >      >> >>      >
 >      >> >>      > 80% of the requests are reads and 20% writes.
 >      >> >>      >
 >      >> >>      > By default, we have 25 requester threads per node
and
 >     100 nodes
 >      >> >> in a
 >      >> >>      > cluster, so a total of 2500 requester threads.
 >      >> >>      >
 >      >> >>      > The cache used is NON-TRANSACTIONAL / dist-sync /
2
 >     owners:
 >      >> >>      >
 >      >> >>      > <namedCache name="clusteredCache">
 >      >> >>      >        <clustering
mode="distribution">
 >      >> >>      >            <stateTransfer
awaitInitialTransfer="true"/>
 >      >> >>      >            <hash numOwners="2"/>
 >      >> >>      >            <sync
replTimeout="20000"/>
 >      >> >>      >        </clustering>
 >      >> >>      >
 >      >> >>      >        <transaction
transactionMode="NON_TRANSACTIONAL"
 >      >> >>      > useEagerLocking="true"
 >      >> >>      >             eagerLockSingleNode="true" 
/>
 >      >> >>      >        <locking
lockAcquisitionTimeout="5000"
 >      >> >> concurrencyLevel="1000"
 >      >> >>      >                
isolationLevel="READ_COMMITTED"
 >      >> >>     useLockStriping="false" />
 >      >> >>      > </namedCache>
 >      >> >>      >
 >      >> >>      > It has 2 owners, a lock acquisition timeout of 5s
and
 >     a repl
 >      >> >>     timeout of
 >      >> >>      > 20s. Lock stripting is off, so we have 1 lock per
key.
 >      >> >>      >
 >      >> >>      > When I run the test, I always get errors like
those
 below:
 >      >> >>      >
 >      >> >>      > org.infinispan.util.concurrent.TimeoutException:
 Unable to
 >      >> >>     acquire lock
 >      >> >>      > after [10 seconds] on key [19386] for requestor
 >      >> >>     [Thread[invoker-3,5,main]]!
 >      >> >>      > Lock held by
 [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
 >      >> >>      >
 >      >> >>      > and
 >      >> >>      >
 >      >> >>      > org.infinispan.util.concurrent.TimeoutException:
Node
 >     m8.1 timed
 >      >> >> out
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Investigation:
 >      >> >>      > ------------
 >      >> >>      > When I looked at UNICAST3, I saw a lot of missing
 >     messages on
 >      >> >> the
 >      >> >>      > receive side and unacked messages on the send
side.
 >     This caused
 >      >> >> me to
 >      >> >>      > look into the (mainly OOB) thread pools and - voila
-
 >     maxed out
 >      >> >> !
 >      >> >>      >
 >      >> >>      > I learned from Pedro that the Infinispan internal
 >     thread pool
 >      >> >> (with a
 >      >> >>      > default of 32 threads) can be configured, so I
 >     increased it to
 >      >> >>     300 and
 >      >> >>      > increased the OOB pools as well.
 >      >> >>      >
 >      >> >>      > This mitigated the problem somewhat, but when I
 >     increased the
 >      >> >>     requester
 >      >> >>      > threads to 100, I had the same problem again.
 >     Apparently, the
 >      >> >>     Infinispan
 >      >> >>      > internal thread pool uses a rejection policy of
"run"
 >     and thus
 >      >> >>     uses the
 >      >> >>      > JGroups (OOB) thread when exhausted.
 >      >> >>      >
 >      >> >>      > I learned (from Pedro and Mircea) that GETs and
PUTs
 >     work as
 >      >> >>     follows in
 >      >> >>      > dist-sync / 2 owners:
 >      >> >>      > - GETs are sent to the primary and backup owners
and
 >     the first
 >      >> >>     response
 >      >> >>      > received is returned to the caller. No locks are
 >     acquired, so
 >      >> >> GETs
 >      >> >>      > shouldn't cause problems.
 >      >> >>      >
 >      >> >>      > - A PUT(K) is sent to the primary owner of K
 >      >> >>      > - The primary owner
 >      >> >>      >        (1) locks K
 >      >> >>      >        (2) updates the backup owner synchronously
 >     *while holding
 >      >> >>     the lock*
 >      >> >>      >        (3) releases the lock
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Hypothesis
 >      >> >>      > ----------
 >      >> >>      > (2) above is done while holding the lock. The sync
 >     update of the
 >      >> >>     backup
 >      >> >>      > owner is done with the lock held to guarantee that
the
 >     primary
 >      >> >> and
 >      >> >>      > backup owner of K have the same values for K.
 >      >> >>      >
 >      >> >>      > However, the sync update *inside the lock scope*
slows
 >     things
 >      >> >>     down (can
 >      >> >>      > it also lead to deadlocks?); there's the risk
that the
 >     request
 >      >> >> is
 >      >> >>      > dropped due to a full incoming thread pool, or
that
 >     the response
 >      >> >>     is not
 >      >> >>      > received because of the same, or that the locking
at
 >     the backup
 >      >> >> owner
 >      >> >>      > blocks for some time.
 >      >> >>      >
 >      >> >>      > If we have many threads modifying the same key,
then
 >     we have a
 >      >> >>     backlog
 >      >> >>      > of locking work against that key. Say we have 100
 >     requester
 >      >> >>     threads and
 >      >> >>      > a 100 node cluster. This means that we have
10'000
 threads
 >      >> >> accessing
 >      >> >>      > keys; with 2'000 writers there's a big
chance that
 >     some writers
 >      >> >>     pick the
 >      >> >>      > same key at the same time.
 >      >> >>      >
 >      >> >>      > For example, if we have 100 threads accessing key
K
 >     and it takes
 >      >> >>     3ms to
 >      >> >>      > replicate K to the backup owner, then the last of
the
 100
 >      >> >> threads
 >      >> >>     waits
 >      >> >>      > ~300ms before it gets a chance to lock K on the
 >     primary owner
 >      >> >> and
 >      >> >>      > replicate it as well.
 >      >> >>      >
 >      >> >>      > Just a small hiccup in sending the PUT to the
primary
 >     owner,
 >      >> >>     sending the
 >      >> >>      > modification to the backup owner, waitting for the
 >     response, or
 >      >> >>     GC, and
 >      >> >>      > the delay will quickly become bigger.
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Verification
 >      >> >>      > ----------
 >      >> >>      > To verify the above, I set numOwners to 1. This
means
 >     that the
 >      >> >>     primary
 >      >> >>      > owner of K does *not* send the modification to the
 >     backup owner,
 >      >> >>     it only
 >      >> >>      > locks K, modifies K and unlocks K again.
 >      >> >>      >
 >      >> >>      > I ran the IspnPerfTest again on 100 nodes, with 25
 >     requesters,
 >      >> >> and NO
 >      >> >>      > PROBLEM !
 >      >> >>      >
 >      >> >>      > I then increased the requesters to 100, 150 and
200
 >     and the test
 >      >> >>      > completed flawlessly ! Performance was around
*40'000
 >     requests
 >      >> >>     per node
 >      >> >>      > per sec* on 4-core boxes !
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Root cause
 >      >> >>      > ---------
 >      >> >>      > *******************
 >      >> >>      > The root cause is the sync RPC of K to the backup
 >     owner(s) of K
 >      >> >> while
 >      >> >>      > the primary owner holds the lock for K.
 >      >> >>      > *******************
 >      >> >>      >
 >      >> >>      > This causes a backlog of threads waiting for the
lock
 >     and that
 >      >> >>     backlog
 >      >> >>      > can grow to exhaust the thread pools. First the
 Infinispan
 >      >> >> internal
 >      >> >>      > thread pool, then the JGroups OOB thread pool. The
 >     latter causes
 >      >> >>      > retransmissions to get dropped, which compounds
the
 >     problem...
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Goal
 >      >> >>      > ----
 >      >> >>      > The goal is to make sure that primary and backup
 >     owner(s) of K
 >      >> >>     have the
 >      >> >>      > same value for K.
 >      >> >>      >
 >      >> >>      > Simply sending the modification to the backup
owner(s)
 >      >> >> asynchronously
 >      >> >>      > won't guarantee this, as modification messages
might
 get
 >      >> >>     processed out
 >      >> >>      > of order as they're OOB !
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Suggested solution
 >      >> >>      > ----------------
 >      >> >>      > The modification RPC needs to be invoked *outside
of
 >     the lock
 >      >> >> scope*:
 >      >> >>      > - lock K
 >      >> >>      > - modify K
 >      >> >>      > - unlock K
 >      >> >>      > - send modification to backup owner(s) // outside
the
 >     lock scope
 >      >> >>      >
 >      >> >>      > The primary owner puts the modification of K into
a
 >     queue from
 >      >> >>     where a
 >      >> >>      > separate thread/task removes it. The thread then
 >     invokes the
 >      >> >>     PUT(K) on
 >      >> >>      > the backup owner(s).
 >      >> >>      >
 >      >> >>      > The queue has the modified keys in FIFO order, so
the
 >      >> >> modifications
 >      >> >>      > arrive at the backup owner(s) in the right order.
 >      >> >>      >
 >      >> >>      > This requires that the way GET is implemented
changes
 >     slightly:
 >      >> >>     instead
 >      >> >>      > of invoking a GET on all owners of K, we only
invoke
 >     it on the
 >      >> >>     primary
 >      >> >>      > owner, then the next-in-line etc.
 >      >> >>      >
 >      >> >>      > The reason for this is that the backup owner(s)
may
 >     not yet have
 >      >> >>      > received the modification of K.
 >      >> >>      >
 >      >> >>      > This is a better impl anyway (we discussed this
 >     before) becuse
 >      >> >> it
 >      >> >>      > generates less traffic; in the normal case, all but
1
 GET
 >      >> >>     requests are
 >      >> >>      > unnecessary.
 >      >> >>      >
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Improvement
 >      >> >>      > -----------
 >      >> >>      > The above solution can be simplified and even made
more
 >      >> >> efficient.
 >      >> >>      > Re-using concepts from IRAC [2], we can simply
store
 the
 >      >> >> modified
 >      >> >>     *keys*
 >      >> >>      > in the modification queue. The modification
 >     replication thread
 >      >> >>     removes
 >      >> >>      > the key, gets the current value and invokes a
 >     PUT/REMOVE on the
 >      >> >>     backup
 >      >> >>      > owner(s).
 >      >> >>      >
 >      >> >>      > Even better: a key is only ever added *once*, so if
we
 >     have
 >      >> >>     [5,2,17,3],
 >      >> >>      > adding key 2 is a no-op because the processing of
key
 >     2 (in
 >      >> >> second
 >      >> >>      > position in the queue) will fetch the up-to-date
value
 >     anyway !
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Misc
 >      >> >>      > ----
 >      >> >>      > - Could we possibly use total order to send the
 >     updates in TO ?
 >      >> >>     TBD (Pedro?)
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > Thoughts ?
 >      >> >>      >
 >      >> >>      >
 >      >> >>      > [1] https://github.com/belaban/IspnPerfTest
 >      >> >>      > [2]
 >      >> >>      >
 >      >> >>
 >      >> >>
 >
 https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-...
 >      >> >>      >
 >      >> >>
 >      >> >>     --
 >      >> >>     Bela Ban, JGroups lead (http://www.jgroups.org)
 >      >> >>     _______________________________________________
 >      >> >>     infinispan-dev mailing list
 >      >> >> infinispan-dev(a)lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>
 >      >> >> <mailto:infinispan-dev@lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>>
 >      >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >      >> >>
 >      >> >>
 >      >> >>
 >      >> >>
 >      >> >> _______________________________________________
 >      >> >> infinispan-dev mailing list
 >      >> >> infinispan-dev(a)lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>
 >      >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >      >> >>
 >      >> >
 >      >> > --
 >      >> > Bela Ban, JGroups lead (http://www.jgroups.org)
 >      >> > _______________________________________________
 >      >> > infinispan-dev mailing list
 >      >> > infinispan-dev(a)lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>
 >      >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >      >> _______________________________________________
 >      >> infinispan-dev mailing list
 >      >> infinispan-dev(a)lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>
 >      >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >      >
 >      >
 >      >
 >      > _______________________________________________
 >      > infinispan-dev mailing list
 >      > infinispan-dev(a)lists.jboss.org
 >     <mailto:infinispan-dev@lists.jboss.org>
 >      > https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >     _______________________________________________
 >     infinispan-dev mailing list
 >     infinispan-dev(a)lists.jboss.org <mailto:
 infinispan-dev(a)lists.jboss.org&gt;
 >     https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >
 >
 >
 >
 > _______________________________________________
 > infinispan-dev mailing list
 > infinispan-dev(a)lists.jboss.org
 > https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >
 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] DIST-SYNC, put(), a problem and a solution