[infinispan-dev] DIST-SYNC, put(), a problem and a solution
Dan Berindei
dan.berindei at gmail.com
Wed Jul 30 05:13:38 EDT 2014
On Wed, Jul 30, 2014 at 12:00 PM, Pedro Ruivo <pedro at infinispan.org> wrote:
>
>
> On 07/30/2014 09:02 AM, Dan Berindei wrote:
> >
>
> >
> > if your proposal is only meant to apply to non-tx caches, you are right
> > you don't have to worry about multiple primary owners... most of the
> > time. But when the primary owner changes, then you do have 2 primary
> > owners (if the new primary owner installs the new topology first), and
> > you do need to coordinate between the 2.
> >
>
> I think it is the same for transactional cache. I.e. the commands wait
> for the transaction data from the new topology to be installed. In the
> non-tx caches, the old primary owner will send the next "sequence
> number" to the new primary owner and only after that, the new primary
> owner starts to give the orders.
>
I'm not sure that's related, commands that wait for a newer topology do not
block a thread since the ISPN-3527 fix.
>
> Otherwise, I can implement a total order version for non-tx caches and
> all the write serialization would be done in JGroups and Infinispan only
> has to apply the updates as soon as they are delivered.
>
Right, that sounds quite interesting. But you'd also need a less-blocking
state transfer ;)
> > Slightly related: we also considered generating a version number on the
> > client for consistency when the HotRod client retries after a primary
> > owner failure [1]. But the clients can't create a monotonic sequence
> > number, so we couldn't use that version number for this.
> >
> > [1] https://issues.jboss.org/browse/ISPN-2956
> >
> >
> > Also I don't see it as an alternative to TOA, I rather expect it to
> > work nicely together: when TOA is enabled you could trust the
> > originating sequence source rather than generate a per-entry
> sequence,
> > and in neither case you need to actually use a Lock.
> > I haven't thought how the sequences would need to interact (if they
> > need), but they seem complementary to resolve different aspects, and
> > also both benefit from the same cleanup and basic structure.
> >
> >
> > We don't acquire locks at all on the backup owners - either in tx or
> > non-tx caches. If state transfer is in progress, we use
> > ConcurrentHashMap.compute() to store tracking information, which uses a
> > synchronized block, so I suppose we do acquire locks. I assume your
> > proposal would require a DataContainer.compute() or something similar on
> > the backups, to ensure that the version check and the replacement are
> > atomic.
> >
> > I still think TOA does what you want for tx caches. Your proposal would
> > only work for non-tx caches, so you couldn't use them together.
> >
> >
> > >> Another aspect is that the "user thread" on the primary owner
> > needs to
> > >> wait (at least until we improve further) and only proceed after
> ACK
> > >> from backup nodes, but this is better modelled through a state
> > >> machine. (Also discussed in Farnborough).
> > >
> > >
> > > To be clear, I don't think keeping the user thread on the
> > originator blocked
> > > until we have the write confirmations from all the backups is a
> > problem - a
> > > sync operation has to block, and it also serves to rate-limit user
> > > operations.
> >
> >
> > There are better ways to rate-limit than to make all operations slow;
> > we don't need to block a thread, we need to react on the reply from
> > the backup owners.
> > You still have an inherent rate-limit in the outgoing packet queues:
> > if these fill up, then and only then it's nice to introduce some back
> > pressure.
> >
> >
> > Sorry, you got me confused when you called the thread on the primary
> > owner a "user thread". I agree that internal stuff can and should be
> > asynchronous, callback based, but the user still has to see a
> > synchronous blocking operation.
> >
> >
> > > The problem appears when the originator is not the primary owner,
> > and the
> > > thread blocking for backup ACKs is from the remote-executor pool
> > (or OOB,
> > > when the remote-executor pool is exhausted).
> >
> > Not following. I guess this is out of scope now that I clarified the
> > proposed solution is only to be applied between primary and backups?
> >
> >
> > Yeah, I was just trying to clarify that there is no danger of exhausting
> > the remote executor/OOB thread pools when the originator of the write
> > command is the primary owner (as it happens in the HotRod server).
> >
> >
> > >>
> > >> It's also conceptually linked to:
> > >> - https://issues.jboss.org/browse/ISPN-1599
> > >> As you need to separate the locks of entries from the effective
> user
> > >> facing lock, at least to implement transactions on top of this
> > model.
> > >
> > >
> > > I think we fixed ISPN-1599 when we changed passivation to use
> > > DataContainer.compute(). WDYT Pedro, is there anything else you'd
> > like to do
> > > in the scope of ISPN-1599?
> > >
> > >>
> > >> I expect this to improve performance in a very significant way,
> but
> > >> it's getting embarrassing that it's still not done; at the next
> face
> > >> to face meeting we should also reserve some time for
> retrospective
> > >> sessions.
> > >
> > >
> > > Implementing the state machine-based interceptor stack may give
> us a
> > > performance boost, but I'm much more certain that it's a very
> > complex, high
> > > risk task... and we don't have a stable test suite yet :)
> >
> > Cleaning up and removing some complexity such as
> > TooManyExecutorsException might help to get it stable, and keep it
> > there :)
> > BTW it was quite stable for me until you changed the JGroups UDP
> > default configuration.
> >
> >
> > Do you really use UDP to run the tests? The default is TCP, but maybe
> > the some tests doesn't use TestCacheManagerFactory...
> >
> > I was just aligning our configs with Bela's recommandations: MERGE3
> > instead of MERGE2 and the removal of UFC in TCP stacks. If they cause
> > problems on your machine, you should make more noise :)
> >
> > Dan
> >
> > Sanne
> >
> > >
> > >
> > >>
> > >>
> > >> Sanne
> > >>
> > >> On 29 July 2014 15:50, Bela Ban <bban at redhat.com
> > <mailto:bban at redhat.com>> wrote:
> > >> >
> > >> >
> > >> > On 29/07/14 16:42, Dan Berindei wrote:
> > >> >> Have you tried regular optimistic/pessimistic transactions as
> > well?
> > >> >
> > >> > Yes, in my first impl. but since I'm making only 1 change per
> > request, I
> > >> > thought a TX is overkill.
> > >> >
> > >> >> They *should* have less issues with the OOB thread pool than
> > non-tx
> > >> >> mode, and
> > >> >> I'm quite curious how they stack against TO in such a large
> > cluster.
> > >> >
> > >> > Why would they have fewer issues with the thread pools ? AIUI,
> > a TX
> > >> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when
> > not using
> > >> > TXs. And we're sync anyway...
> > >> >
> > >> >
> > >> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban at redhat.com
> > <mailto:bban at redhat.com>
> > >> >> <mailto:bban at redhat.com <mailto:bban at redhat.com>>> wrote:
> > >> >>
> > >> >> Following up on my own email, I changed the config to use
> > Pedro's
> > >> >> excellent total order implementation:
> > >> >>
> > >> >> <transaction transactionMode="TRANSACTIONAL"
> > >> >> transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
> > >> >> useEagerLocking="true" eagerLockSingleNode="true">
> > >> >> <recovery enabled="false"/>
> > >> >>
> > >> >> With 100 nodes and 25 requester threads/node, I did NOT
> > run into
> > >> >> any
> > >> >> locking issues !
> > >> >>
> > >> >> I could even go up to 200 requester threads/node and the
> > perf was ~
> > >> >> 7'000-8'000 requests/sec/node. Not too bad !
> > >> >>
> > >> >> This really validates the concept of lockless total-order
> > >> >> dissemination
> > >> >> of TXs; for the first time, this has been tested on a
> > large(r)
> > >> >> scale
> > >> >> (previously only on 25 nodes) and IT WORKS ! :-)
> > >> >>
> > >> >> I still believe we should implement my suggested solution
> for
> > >> >> non-TO
> > >> >> configs, but short of configuring thread pools of 1000
> > threads or
> > >> >> higher, I hope TO will allow me to finally test a 500 node
> > >> >> Infinispan
> > >> >> cluster !
> > >> >>
> > >> >>
> > >> >> On 29/07/14 15:56, Bela Ban wrote:
> > >> >> > Hi guys,
> > >> >> >
> > >> >> > sorry for the long post, but I do think I ran into an
> > important
> > >> >> problem
> > >> >> > and we need to fix it ... :-)
> > >> >> >
> > >> >> > I've spent the last couple of days running the
> > IspnPerfTest [1]
> > >> >> perftest
> > >> >> > on Google Compute Engine (GCE), and I've run into a
> > problem with
> > >> >> > Infinispan. It is a design problem and can be
> mitigated by
> > >> >> sizing
> > >> >> thread
> > >> >> > pools correctly, but cannot be eliminated entirely.
> > >> >> >
> > >> >> >
> > >> >> > Symptom:
> > >> >> > --------
> > >> >> > IspnPerfTest has every node in a cluster perform
> > 20'000 requests
> > >> >> on keys
> > >> >> > in range [1..20000].
> > >> >> >
> > >> >> > 80% of the requests are reads and 20% writes.
> > >> >> >
> > >> >> > By default, we have 25 requester threads per node and
> > 100 nodes
> > >> >> in a
> > >> >> > cluster, so a total of 2500 requester threads.
> > >> >> >
> > >> >> > The cache used is NON-TRANSACTIONAL / dist-sync / 2
> > owners:
> > >> >> >
> > >> >> > <namedCache name="clusteredCache">
> > >> >> > <clustering mode="distribution">
> > >> >> > <stateTransfer awaitInitialTransfer="true"/>
> > >> >> > <hash numOwners="2"/>
> > >> >> > <sync replTimeout="20000"/>
> > >> >> > </clustering>
> > >> >> >
> > >> >> > <transaction transactionMode="NON_TRANSACTIONAL"
> > >> >> > useEagerLocking="true"
> > >> >> > eagerLockSingleNode="true" />
> > >> >> > <locking lockAcquisitionTimeout="5000"
> > >> >> concurrencyLevel="1000"
> > >> >> > isolationLevel="READ_COMMITTED"
> > >> >> useLockStriping="false" />
> > >> >> > </namedCache>
> > >> >> >
> > >> >> > It has 2 owners, a lock acquisition timeout of 5s and
> > a repl
> > >> >> timeout of
> > >> >> > 20s. Lock stripting is off, so we have 1 lock per key.
> > >> >> >
> > >> >> > When I run the test, I always get errors like those
> below:
> > >> >> >
> > >> >> > org.infinispan.util.concurrent.TimeoutException:
> Unable to
> > >> >> acquire lock
> > >> >> > after [10 seconds] on key [19386] for requestor
> > >> >> [Thread[invoker-3,5,main]]!
> > >> >> > Lock held by
> [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
> > >> >> >
> > >> >> > and
> > >> >> >
> > >> >> > org.infinispan.util.concurrent.TimeoutException: Node
> > m8.1 timed
> > >> >> out
> > >> >> >
> > >> >> >
> > >> >> > Investigation:
> > >> >> > ------------
> > >> >> > When I looked at UNICAST3, I saw a lot of missing
> > messages on
> > >> >> the
> > >> >> > receive side and unacked messages on the send side.
> > This caused
> > >> >> me to
> > >> >> > look into the (mainly OOB) thread pools and - voila -
> > maxed out
> > >> >> !
> > >> >> >
> > >> >> > I learned from Pedro that the Infinispan internal
> > thread pool
> > >> >> (with a
> > >> >> > default of 32 threads) can be configured, so I
> > increased it to
> > >> >> 300 and
> > >> >> > increased the OOB pools as well.
> > >> >> >
> > >> >> > This mitigated the problem somewhat, but when I
> > increased the
> > >> >> requester
> > >> >> > threads to 100, I had the same problem again.
> > Apparently, the
> > >> >> Infinispan
> > >> >> > internal thread pool uses a rejection policy of "run"
> > and thus
> > >> >> uses the
> > >> >> > JGroups (OOB) thread when exhausted.
> > >> >> >
> > >> >> > I learned (from Pedro and Mircea) that GETs and PUTs
> > work as
> > >> >> follows in
> > >> >> > dist-sync / 2 owners:
> > >> >> > - GETs are sent to the primary and backup owners and
> > the first
> > >> >> response
> > >> >> > received is returned to the caller. No locks are
> > acquired, so
> > >> >> GETs
> > >> >> > shouldn't cause problems.
> > >> >> >
> > >> >> > - A PUT(K) is sent to the primary owner of K
> > >> >> > - The primary owner
> > >> >> > (1) locks K
> > >> >> > (2) updates the backup owner synchronously
> > *while holding
> > >> >> the lock*
> > >> >> > (3) releases the lock
> > >> >> >
> > >> >> >
> > >> >> > Hypothesis
> > >> >> > ----------
> > >> >> > (2) above is done while holding the lock. The sync
> > update of the
> > >> >> backup
> > >> >> > owner is done with the lock held to guarantee that the
> > primary
> > >> >> and
> > >> >> > backup owner of K have the same values for K.
> > >> >> >
> > >> >> > However, the sync update *inside the lock scope* slows
> > things
> > >> >> down (can
> > >> >> > it also lead to deadlocks?); there's the risk that the
> > request
> > >> >> is
> > >> >> > dropped due to a full incoming thread pool, or that
> > the response
> > >> >> is not
> > >> >> > received because of the same, or that the locking at
> > the backup
> > >> >> owner
> > >> >> > blocks for some time.
> > >> >> >
> > >> >> > If we have many threads modifying the same key, then
> > we have a
> > >> >> backlog
> > >> >> > of locking work against that key. Say we have 100
> > requester
> > >> >> threads and
> > >> >> > a 100 node cluster. This means that we have 10'000
> threads
> > >> >> accessing
> > >> >> > keys; with 2'000 writers there's a big chance that
> > some writers
> > >> >> pick the
> > >> >> > same key at the same time.
> > >> >> >
> > >> >> > For example, if we have 100 threads accessing key K
> > and it takes
> > >> >> 3ms to
> > >> >> > replicate K to the backup owner, then the last of the
> 100
> > >> >> threads
> > >> >> waits
> > >> >> > ~300ms before it gets a chance to lock K on the
> > primary owner
> > >> >> and
> > >> >> > replicate it as well.
> > >> >> >
> > >> >> > Just a small hiccup in sending the PUT to the primary
> > owner,
> > >> >> sending the
> > >> >> > modification to the backup owner, waitting for the
> > response, or
> > >> >> GC, and
> > >> >> > the delay will quickly become bigger.
> > >> >> >
> > >> >> >
> > >> >> > Verification
> > >> >> > ----------
> > >> >> > To verify the above, I set numOwners to 1. This means
> > that the
> > >> >> primary
> > >> >> > owner of K does *not* send the modification to the
> > backup owner,
> > >> >> it only
> > >> >> > locks K, modifies K and unlocks K again.
> > >> >> >
> > >> >> > I ran the IspnPerfTest again on 100 nodes, with 25
> > requesters,
> > >> >> and NO
> > >> >> > PROBLEM !
> > >> >> >
> > >> >> > I then increased the requesters to 100, 150 and 200
> > and the test
> > >> >> > completed flawlessly ! Performance was around *40'000
> > requests
> > >> >> per node
> > >> >> > per sec* on 4-core boxes !
> > >> >> >
> > >> >> >
> > >> >> > Root cause
> > >> >> > ---------
> > >> >> > *******************
> > >> >> > The root cause is the sync RPC of K to the backup
> > owner(s) of K
> > >> >> while
> > >> >> > the primary owner holds the lock for K.
> > >> >> > *******************
> > >> >> >
> > >> >> > This causes a backlog of threads waiting for the lock
> > and that
> > >> >> backlog
> > >> >> > can grow to exhaust the thread pools. First the
> Infinispan
> > >> >> internal
> > >> >> > thread pool, then the JGroups OOB thread pool. The
> > latter causes
> > >> >> > retransmissions to get dropped, which compounds the
> > problem...
> > >> >> >
> > >> >> >
> > >> >> > Goal
> > >> >> > ----
> > >> >> > The goal is to make sure that primary and backup
> > owner(s) of K
> > >> >> have the
> > >> >> > same value for K.
> > >> >> >
> > >> >> > Simply sending the modification to the backup owner(s)
> > >> >> asynchronously
> > >> >> > won't guarantee this, as modification messages might
> get
> > >> >> processed out
> > >> >> > of order as they're OOB !
> > >> >> >
> > >> >> >
> > >> >> > Suggested solution
> > >> >> > ----------------
> > >> >> > The modification RPC needs to be invoked *outside of
> > the lock
> > >> >> scope*:
> > >> >> > - lock K
> > >> >> > - modify K
> > >> >> > - unlock K
> > >> >> > - send modification to backup owner(s) // outside the
> > lock scope
> > >> >> >
> > >> >> > The primary owner puts the modification of K into a
> > queue from
> > >> >> where a
> > >> >> > separate thread/task removes it. The thread then
> > invokes the
> > >> >> PUT(K) on
> > >> >> > the backup owner(s).
> > >> >> >
> > >> >> > The queue has the modified keys in FIFO order, so the
> > >> >> modifications
> > >> >> > arrive at the backup owner(s) in the right order.
> > >> >> >
> > >> >> > This requires that the way GET is implemented changes
> > slightly:
> > >> >> instead
> > >> >> > of invoking a GET on all owners of K, we only invoke
> > it on the
> > >> >> primary
> > >> >> > owner, then the next-in-line etc.
> > >> >> >
> > >> >> > The reason for this is that the backup owner(s) may
> > not yet have
> > >> >> > received the modification of K.
> > >> >> >
> > >> >> > This is a better impl anyway (we discussed this
> > before) becuse
> > >> >> it
> > >> >> > generates less traffic; in the normal case, all but 1
> GET
> > >> >> requests are
> > >> >> > unnecessary.
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > Improvement
> > >> >> > -----------
> > >> >> > The above solution can be simplified and even made more
> > >> >> efficient.
> > >> >> > Re-using concepts from IRAC [2], we can simply store
> the
> > >> >> modified
> > >> >> *keys*
> > >> >> > in the modification queue. The modification
> > replication thread
> > >> >> removes
> > >> >> > the key, gets the current value and invokes a
> > PUT/REMOVE on the
> > >> >> backup
> > >> >> > owner(s).
> > >> >> >
> > >> >> > Even better: a key is only ever added *once*, so if we
> > have
> > >> >> [5,2,17,3],
> > >> >> > adding key 2 is a no-op because the processing of key
> > 2 (in
> > >> >> second
> > >> >> > position in the queue) will fetch the up-to-date value
> > anyway !
> > >> >> >
> > >> >> >
> > >> >> > Misc
> > >> >> > ----
> > >> >> > - Could we possibly use total order to send the
> > updates in TO ?
> > >> >> TBD (Pedro?)
> > >> >> >
> > >> >> >
> > >> >> > Thoughts ?
> > >> >> >
> > >> >> >
> > >> >> > [1] https://github.com/belaban/IspnPerfTest
> > >> >> > [2]
> > >> >> >
> > >> >>
> > >> >>
> >
> https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
> > >> >> >
> > >> >>
> > >> >> --
> > >> >> Bela Ban, JGroups lead (http://www.jgroups.org)
> > >> >> _______________________________________________
> > >> >> infinispan-dev mailing list
> > >> >> infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>
> > >> >> <mailto:infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>>
> > >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> _______________________________________________
> > >> >> infinispan-dev mailing list
> > >> >> infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>
> > >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > >> >>
> > >> >
> > >> > --
> > >> > Bela Ban, JGroups lead (http://www.jgroups.org)
> > >> > _______________________________________________
> > >> > infinispan-dev mailing list
> > >> > infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>
> > >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > >> _______________________________________________
> > >> infinispan-dev mailing list
> > >> infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>
> > >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > >
> > >
> > >
> > > _______________________________________________
> > > infinispan-dev mailing list
> > > infinispan-dev at lists.jboss.org
> > <mailto:infinispan-dev at lists.jboss.org>
> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org <mailto:
> infinispan-dev at lists.jboss.org>
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140730/cdfe7c5a/attachment-0001.html
More information about the infinispan-dev
mailing list