[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Wed Jul 30 05:00:45 EDT 2014

On 07/30/2014 09:02 AM, Dan Berindei wrote:
>

>
> if your proposal is only meant to apply to non-tx caches, you are right
> you don't have to worry about multiple primary owners... most of the
> time. But when the primary owner changes, then you do have 2 primary
> owners (if the new primary owner installs the new topology first), and
> you do need to coordinate between the 2.
>

I think it is the same for transactional cache. I.e. the commands wait 
for the transaction data from the new topology to be installed. In the 
non-tx caches, the old primary owner will send the next "sequence 
number" to the new primary owner and only after that, the new primary 
owner starts to give the orders.

Otherwise, I can implement a total order version for non-tx caches and 
all the write serialization would be done in JGroups and Infinispan only 
has to apply the updates as soon as they are delivered.

> Slightly related: we also considered generating a version number on the
> client for consistency when the HotRod client retries after a primary
> owner failure [1]. But the clients can't create a monotonic sequence
> number, so we couldn't use that version number for this.
>
> [1] https://issues.jboss.org/browse/ISPN-2956
>
>
>     Also I don't see it as an alternative to TOA, I rather expect it to
>     work nicely together: when TOA is enabled you could trust the
>     originating sequence source rather than generate a per-entry sequence,
>     and in neither case you need to actually use a Lock.
>     I haven't thought how the sequences would need to interact (if they
>     need), but they seem complementary to resolve different aspects, and
>     also both benefit from the same cleanup and basic structure.
>
>
> We don't acquire locks at all on the backup owners - either in tx or
> non-tx caches. If state transfer is in progress, we use
> ConcurrentHashMap.compute() to store tracking information, which uses a
> synchronized block, so I suppose we do acquire locks. I assume your
> proposal would require a DataContainer.compute() or something similar on
> the backups, to ensure that the version check and the replacement are
> atomic.
>
> I still think TOA does what you want for tx caches. Your proposal would
> only work for non-tx caches, so you couldn't use them together.
>
>
>      >> Another aspect is that the "user thread" on the primary owner
>     needs to
>      >> wait (at least until we improve further) and only proceed after ACK
>      >> from backup nodes, but this is better modelled through a state
>      >> machine. (Also discussed in Farnborough).
>      >
>      >
>      > To be clear, I don't think keeping the user thread on the
>     originator blocked
>      > until we have the write confirmations from all the backups is a
>     problem - a
>      > sync operation has to block, and it also serves to rate-limit user
>      > operations.
>
>
>     There are better ways to rate-limit than to make all operations slow;
>     we don't need to block a thread, we need to react on the reply from
>     the backup owners.
>     You still have an inherent rate-limit in the outgoing packet queues:
>     if these fill up, then and only then it's nice to introduce some back
>     pressure.
>
>
> Sorry, you got me confused when you called the thread on the primary
> owner a "user thread". I agree that internal stuff can and should be
> asynchronous, callback based, but the user still has to see a
> synchronous blocking operation.
>
>
>      > The problem appears when the originator is not the primary owner,
>     and the
>      > thread blocking for backup ACKs is from the remote-executor pool
>     (or OOB,
>      > when the remote-executor pool is exhausted).
>
>     Not following. I guess this is out of scope now that I clarified the
>     proposed solution is only to be applied between primary and backups?
>
>
> Yeah, I was just trying to clarify that there is no danger of exhausting
> the remote executor/OOB thread pools when the originator of the write
> command is the primary owner (as it happens in the HotRod server).
>
>
>      >>
>      >> It's also conceptually linked to:
>      >>  - https://issues.jboss.org/browse/ISPN-1599
>      >> As you need to separate the locks of entries from the effective user
>      >> facing lock, at least to implement transactions on top of this
>     model.
>      >
>      >
>      > I think we fixed ISPN-1599 when we changed passivation to use
>      > DataContainer.compute(). WDYT Pedro, is there anything else you'd
>     like to do
>      > in the scope of ISPN-1599?
>      >
>      >>
>      >> I expect this to improve performance in a very significant way, but
>      >> it's getting embarrassing that it's still not done; at the next face
>      >> to face meeting we should also reserve some time for retrospective
>      >> sessions.
>      >
>      >
>      > Implementing the state machine-based interceptor stack may give us a
>      > performance boost, but I'm much more certain that it's a very
>     complex, high
>      > risk task... and we don't have a stable test suite yet :)
>
>     Cleaning up and removing some complexity such as
>     TooManyExecutorsException might help to get it stable, and keep it
>     there :)
>     BTW it was quite stable for me until you changed the JGroups UDP
>     default configuration.
>
>
> Do you really use UDP to run the tests? The default is TCP, but maybe
> the some tests doesn't use TestCacheManagerFactory...
>
> I was just aligning our configs with Bela's recommandations: MERGE3
> instead of MERGE2 and the removal of UFC in TCP stacks. If they cause
> problems on your machine, you should make more noise :)
>
> Dan
>
>     Sanne
>
>      >
>      >
>      >>
>      >>
>      >> Sanne
>      >>
>      >> On 29 July 2014 15:50, Bela Ban <bban at redhat.com
>     <mailto:bban at redhat.com>> wrote:
>      >> >
>      >> >
>      >> > On 29/07/14 16:42, Dan Berindei wrote:
>      >> >> Have you tried regular optimistic/pessimistic transactions as
>     well?
>      >> >
>      >> > Yes, in my first impl. but since I'm making only 1 change per
>     request, I
>      >> > thought a TX is overkill.
>      >> >
>      >> >> They *should* have less issues with the OOB thread pool than
>     non-tx
>      >> >> mode, and
>      >> >> I'm quite curious how they stack against TO in such a large
>     cluster.
>      >> >
>      >> > Why would they have fewer issues with the thread pools ? AIUI,
>     a TX
>      >> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when
>     not using
>      >> > TXs. And we're sync anyway...
>      >> >
>      >> >
>      >> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban at redhat.com
>     <mailto:bban at redhat.com>
>      >> >> <mailto:bban at redhat.com <mailto:bban at redhat.com>>> wrote:
>      >> >>
>      >> >>     Following up on my own email, I changed the config to use
>     Pedro's
>      >> >>     excellent total order implementation:
>      >> >>
>      >> >>     <transaction transactionMode="TRANSACTIONAL"
>      >> >>     transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
>      >> >>     useEagerLocking="true" eagerLockSingleNode="true">
>      >> >>                   <recovery enabled="false"/>
>      >> >>
>      >> >>     With 100 nodes and 25 requester threads/node, I did NOT
>     run into
>      >> >> any
>      >> >>     locking issues !
>      >> >>
>      >> >>     I could even go up to 200 requester threads/node and the
>     perf was ~
>      >> >>     7'000-8'000 requests/sec/node. Not too bad !
>      >> >>
>      >> >>     This really validates the concept of lockless total-order
>      >> >> dissemination
>      >> >>     of TXs; for the first time, this has been tested on a
>     large(r)
>      >> >> scale
>      >> >>     (previously only on 25 nodes) and IT WORKS ! :-)
>      >> >>
>      >> >>     I still believe we should implement my suggested solution for
>      >> >> non-TO
>      >> >>     configs, but short of configuring thread pools of 1000
>     threads or
>      >> >>     higher, I hope TO will allow me to finally test a 500 node
>      >> >> Infinispan
>      >> >>     cluster !
>      >> >>
>      >> >>
>      >> >>     On 29/07/14 15:56, Bela Ban wrote:
>      >> >>      > Hi guys,
>      >> >>      >
>      >> >>      > sorry for the long post, but I do think I ran into an
>     important
>      >> >>     problem
>      >> >>      > and we need to fix it ... :-)
>      >> >>      >
>      >> >>      > I've spent the last couple of days running the
>     IspnPerfTest [1]
>      >> >>     perftest
>      >> >>      > on Google Compute Engine (GCE), and I've run into a
>     problem with
>      >> >>      > Infinispan. It is a design problem and can be mitigated by
>      >> >> sizing
>      >> >>     thread
>      >> >>      > pools correctly, but cannot be eliminated entirely.
>      >> >>      >
>      >> >>      >
>      >> >>      > Symptom:
>      >> >>      > --------
>      >> >>      > IspnPerfTest has every node in a cluster perform
>     20'000 requests
>      >> >>     on keys
>      >> >>      > in range [1..20000].
>      >> >>      >
>      >> >>      > 80% of the requests are reads and 20% writes.
>      >> >>      >
>      >> >>      > By default, we have 25 requester threads per node and
>     100 nodes
>      >> >> in a
>      >> >>      > cluster, so a total of 2500 requester threads.
>      >> >>      >
>      >> >>      > The cache used is NON-TRANSACTIONAL / dist-sync / 2
>     owners:
>      >> >>      >
>      >> >>      > <namedCache name="clusteredCache">
>      >> >>      >        <clustering mode="distribution">
>      >> >>      >            <stateTransfer awaitInitialTransfer="true"/>
>      >> >>      >            <hash numOwners="2"/>
>      >> >>      >            <sync replTimeout="20000"/>
>      >> >>      >        </clustering>
>      >> >>      >
>      >> >>      >        <transaction transactionMode="NON_TRANSACTIONAL"
>      >> >>      > useEagerLocking="true"
>      >> >>      >             eagerLockSingleNode="true"  />
>      >> >>      >        <locking lockAcquisitionTimeout="5000"
>      >> >> concurrencyLevel="1000"
>      >> >>      >                 isolationLevel="READ_COMMITTED"
>      >> >>     useLockStriping="false" />
>      >> >>      > </namedCache>
>      >> >>      >
>      >> >>      > It has 2 owners, a lock acquisition timeout of 5s and
>     a repl
>      >> >>     timeout of
>      >> >>      > 20s. Lock stripting is off, so we have 1 lock per key.
>      >> >>      >
>      >> >>      > When I run the test, I always get errors like those below:
>      >> >>      >
>      >> >>      > org.infinispan.util.concurrent.TimeoutException: Unable to
>      >> >>     acquire lock
>      >> >>      > after [10 seconds] on key [19386] for requestor
>      >> >>     [Thread[invoker-3,5,main]]!
>      >> >>      > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
>      >> >>      >
>      >> >>      > and
>      >> >>      >
>      >> >>      > org.infinispan.util.concurrent.TimeoutException: Node
>     m8.1 timed
>      >> >> out
>      >> >>      >
>      >> >>      >
>      >> >>      > Investigation:
>      >> >>      > ------------
>      >> >>      > When I looked at UNICAST3, I saw a lot of missing
>     messages on
>      >> >> the
>      >> >>      > receive side and unacked messages on the send side.
>     This caused
>      >> >> me to
>      >> >>      > look into the (mainly OOB) thread pools and - voila -
>     maxed out
>      >> >> !
>      >> >>      >
>      >> >>      > I learned from Pedro that the Infinispan internal
>     thread pool
>      >> >> (with a
>      >> >>      > default of 32 threads) can be configured, so I
>     increased it to
>      >> >>     300 and
>      >> >>      > increased the OOB pools as well.
>      >> >>      >
>      >> >>      > This mitigated the problem somewhat, but when I
>     increased the
>      >> >>     requester
>      >> >>      > threads to 100, I had the same problem again.
>     Apparently, the
>      >> >>     Infinispan
>      >> >>      > internal thread pool uses a rejection policy of "run"
>     and thus
>      >> >>     uses the
>      >> >>      > JGroups (OOB) thread when exhausted.
>      >> >>      >
>      >> >>      > I learned (from Pedro and Mircea) that GETs and PUTs
>     work as
>      >> >>     follows in
>      >> >>      > dist-sync / 2 owners:
>      >> >>      > - GETs are sent to the primary and backup owners and
>     the first
>      >> >>     response
>      >> >>      > received is returned to the caller. No locks are
>     acquired, so
>      >> >> GETs
>      >> >>      > shouldn't cause problems.
>      >> >>      >
>      >> >>      > - A PUT(K) is sent to the primary owner of K
>      >> >>      > - The primary owner
>      >> >>      >        (1) locks K
>      >> >>      >        (2) updates the backup owner synchronously
>     *while holding
>      >> >>     the lock*
>      >> >>      >        (3) releases the lock
>      >> >>      >
>      >> >>      >
>      >> >>      > Hypothesis
>      >> >>      > ----------
>      >> >>      > (2) above is done while holding the lock. The sync
>     update of the
>      >> >>     backup
>      >> >>      > owner is done with the lock held to guarantee that the
>     primary
>      >> >> and
>      >> >>      > backup owner of K have the same values for K.
>      >> >>      >
>      >> >>      > However, the sync update *inside the lock scope* slows
>     things
>      >> >>     down (can
>      >> >>      > it also lead to deadlocks?); there's the risk that the
>     request
>      >> >> is
>      >> >>      > dropped due to a full incoming thread pool, or that
>     the response
>      >> >>     is not
>      >> >>      > received because of the same, or that the locking at
>     the backup
>      >> >> owner
>      >> >>      > blocks for some time.
>      >> >>      >
>      >> >>      > If we have many threads modifying the same key, then
>     we have a
>      >> >>     backlog
>      >> >>      > of locking work against that key. Say we have 100
>     requester
>      >> >>     threads and
>      >> >>      > a 100 node cluster. This means that we have 10'000 threads
>      >> >> accessing
>      >> >>      > keys; with 2'000 writers there's a big chance that
>     some writers
>      >> >>     pick the
>      >> >>      > same key at the same time.
>      >> >>      >
>      >> >>      > For example, if we have 100 threads accessing key K
>     and it takes
>      >> >>     3ms to
>      >> >>      > replicate K to the backup owner, then the last of the 100
>      >> >> threads
>      >> >>     waits
>      >> >>      > ~300ms before it gets a chance to lock K on the
>     primary owner
>      >> >> and
>      >> >>      > replicate it as well.
>      >> >>      >
>      >> >>      > Just a small hiccup in sending the PUT to the primary
>     owner,
>      >> >>     sending the
>      >> >>      > modification to the backup owner, waitting for the
>     response, or
>      >> >>     GC, and
>      >> >>      > the delay will quickly become bigger.
>      >> >>      >
>      >> >>      >
>      >> >>      > Verification
>      >> >>      > ----------
>      >> >>      > To verify the above, I set numOwners to 1. This means
>     that the
>      >> >>     primary
>      >> >>      > owner of K does *not* send the modification to the
>     backup owner,
>      >> >>     it only
>      >> >>      > locks K, modifies K and unlocks K again.
>      >> >>      >
>      >> >>      > I ran the IspnPerfTest again on 100 nodes, with 25
>     requesters,
>      >> >> and NO
>      >> >>      > PROBLEM !
>      >> >>      >
>      >> >>      > I then increased the requesters to 100, 150 and 200
>     and the test
>      >> >>      > completed flawlessly ! Performance was around *40'000
>     requests
>      >> >>     per node
>      >> >>      > per sec* on 4-core boxes !
>      >> >>      >
>      >> >>      >
>      >> >>      > Root cause
>      >> >>      > ---------
>      >> >>      > *******************
>      >> >>      > The root cause is the sync RPC of K to the backup
>     owner(s) of K
>      >> >> while
>      >> >>      > the primary owner holds the lock for K.
>      >> >>      > *******************
>      >> >>      >
>      >> >>      > This causes a backlog of threads waiting for the lock
>     and that
>      >> >>     backlog
>      >> >>      > can grow to exhaust the thread pools. First the Infinispan
>      >> >> internal
>      >> >>      > thread pool, then the JGroups OOB thread pool. The
>     latter causes
>      >> >>      > retransmissions to get dropped, which compounds the
>     problem...
>      >> >>      >
>      >> >>      >
>      >> >>      > Goal
>      >> >>      > ----
>      >> >>      > The goal is to make sure that primary and backup
>     owner(s) of K
>      >> >>     have the
>      >> >>      > same value for K.
>      >> >>      >
>      >> >>      > Simply sending the modification to the backup owner(s)
>      >> >> asynchronously
>      >> >>      > won't guarantee this, as modification messages might get
>      >> >>     processed out
>      >> >>      > of order as they're OOB !
>      >> >>      >
>      >> >>      >
>      >> >>      > Suggested solution
>      >> >>      > ----------------
>      >> >>      > The modification RPC needs to be invoked *outside of
>     the lock
>      >> >> scope*:
>      >> >>      > - lock K
>      >> >>      > - modify K
>      >> >>      > - unlock K
>      >> >>      > - send modification to backup owner(s) // outside the
>     lock scope
>      >> >>      >
>      >> >>      > The primary owner puts the modification of K into a
>     queue from
>      >> >>     where a
>      >> >>      > separate thread/task removes it. The thread then
>     invokes the
>      >> >>     PUT(K) on
>      >> >>      > the backup owner(s).
>      >> >>      >
>      >> >>      > The queue has the modified keys in FIFO order, so the
>      >> >> modifications
>      >> >>      > arrive at the backup owner(s) in the right order.
>      >> >>      >
>      >> >>      > This requires that the way GET is implemented changes
>     slightly:
>      >> >>     instead
>      >> >>      > of invoking a GET on all owners of K, we only invoke
>     it on the
>      >> >>     primary
>      >> >>      > owner, then the next-in-line etc.
>      >> >>      >
>      >> >>      > The reason for this is that the backup owner(s) may
>     not yet have
>      >> >>      > received the modification of K.
>      >> >>      >
>      >> >>      > This is a better impl anyway (we discussed this
>     before) becuse
>      >> >> it
>      >> >>      > generates less traffic; in the normal case, all but 1 GET
>      >> >>     requests are
>      >> >>      > unnecessary.
>      >> >>      >
>      >> >>      >
>      >> >>      >
>      >> >>      > Improvement
>      >> >>      > -----------
>      >> >>      > The above solution can be simplified and even made more
>      >> >> efficient.
>      >> >>      > Re-using concepts from IRAC [2], we can simply store the
>      >> >> modified
>      >> >>     *keys*
>      >> >>      > in the modification queue. The modification
>     replication thread
>      >> >>     removes
>      >> >>      > the key, gets the current value and invokes a
>     PUT/REMOVE on the
>      >> >>     backup
>      >> >>      > owner(s).
>      >> >>      >
>      >> >>      > Even better: a key is only ever added *once*, so if we
>     have
>      >> >>     [5,2,17,3],
>      >> >>      > adding key 2 is a no-op because the processing of key
>     2 (in
>      >> >> second
>      >> >>      > position in the queue) will fetch the up-to-date value
>     anyway !
>      >> >>      >
>      >> >>      >
>      >> >>      > Misc
>      >> >>      > ----
>      >> >>      > - Could we possibly use total order to send the
>     updates in TO ?
>      >> >>     TBD (Pedro?)
>      >> >>      >
>      >> >>      >
>      >> >>      > Thoughts ?
>      >> >>      >
>      >> >>      >
>      >> >>      > [1] https://github.com/belaban/IspnPerfTest
>      >> >>      > [2]
>      >> >>      >
>      >> >>
>      >> >>
>     https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
>      >> >>      >
>      >> >>
>      >> >>     --
>      >> >>     Bela Ban, JGroups lead (http://www.jgroups.org)
>      >> >>     _______________________________________________
>      >> >>     infinispan-dev mailing list
>      >> >> infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>
>      >> >> <mailto:infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>>
>      >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>      >> >>
>      >> >>
>      >> >>
>      >> >>
>      >> >> _______________________________________________
>      >> >> infinispan-dev mailing list
>      >> >> infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>
>      >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>      >> >>
>      >> >
>      >> > --
>      >> > Bela Ban, JGroups lead (http://www.jgroups.org)
>      >> > _______________________________________________
>      >> > infinispan-dev mailing list
>      >> > infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>
>      >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>      >> _______________________________________________
>      >> infinispan-dev mailing list
>      >> infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>
>      >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>      >
>      >
>      >
>      > _______________________________________________
>      > infinispan-dev mailing list
>      > infinispan-dev at lists.jboss.org
>     <mailto:infinispan-dev at lists.jboss.org>
>      > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>     _______________________________________________
>     infinispan-dev mailing list
>     infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
>     https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>