On 07/30/2014 09:02 AM, Dan Berindei wrote:
> if your proposal is only meant to apply to non-tx caches,
you are right
> you don't have to worry about multiple primary owners... most of the
> time. But when the primary owner changes, then you do have 2 primary
> owners (if the new primary owner installs the new topology first), and
> you do need to coordinate between the 2.
I think it is the same for transactional cache. I.e. the commands wait
for the transaction data from the new topology to be installed. In the
non-tx caches, the old primary owner will send the next "sequence
number" to the new primary owner and only after that, the new primary
owner starts to give the orders.
Otherwise, I can implement a total order version for non-tx caches and
all the write serialization would be done in JGroups and Infinispan only
has to apply the updates as soon as they are delivered.
> Slightly related: we also considered generating a version number on the
> client for consistency when the HotRod client retries after a primary
> owner failure [1]. But the clients can't create a monotonic sequence
> number, so we couldn't use that version number for this.
> [1]
https://issues.jboss.org/browse/ISPN-2956
> Also I don't see it as
an alternative to TOA, I rather expect it to
> work nicely together: when TOA is enabled you could trust the
> originating sequence source rather than generate a per-entry sequence,
> and in neither case you need to actually use a Lock.
> I haven't thought how the sequences would need to interact (if they
> need), but they seem complementary to resolve different aspects, and
> also both benefit from the same cleanup and basic structure.
> We don't acquire locks at
all on the backup owners - either in tx or
> non-tx caches. If state transfer is in progress, we use
> ConcurrentHashMap.compute() to store tracking information, which uses a
> synchronized block, so I suppose we do acquire locks. I assume your
> proposal would require a DataContainer.compute() or something similar on
> the backups, to ensure that the version check and the replacement are
> atomic.
> I still think TOA does what you want for tx caches. Your
proposal would
> only work for non-tx caches, so you couldn't use them together.
> >> Another aspect is
that the "user thread" on the primary owner
> needs to
> >> wait (at least until we improve further) and only proceed after ACK
> >> from backup nodes, but this is better modelled through a state
> >> machine. (Also discussed in Farnborough).
>
>
> > To be clear, I don't think keeping the user
thread on the
> originator blocked
> > until we have the write confirmations from all the backups is a
> problem - a
> > sync operation has to block, and it also serves to rate-limit user
> > operations.
> There are better ways to
rate-limit than to make all operations slow;
> we don't need to block a thread, we need to react on the reply from
> the backup owners.
> You still have an inherent rate-limit in the outgoing packet queues:
> if these fill up, then and only then it's nice to introduce some back
> pressure.
> Sorry, you got me confused when
you called the thread on the primary
> owner a "user thread". I agree that internal stuff can and should be
> asynchronous, callback based, but the user still has to see a
> synchronous blocking operation.
> > The problem appears
when the originator is not the primary owner,
> and the
> > thread blocking for backup ACKs is from the remote-executor pool
> (or OOB,
> > when the remote-executor pool is exhausted).
> Not following. I guess this is out of scope now that I
clarified the
> proposed solution is only to be applied between primary and backups?
> Yeah, I was just trying to
clarify that there is no danger of exhausting
> the remote executor/OOB thread pools when the originator of the write
> command is the primary owner (as it happens in the HotRod server).
> >
> >> It's also conceptually linked to:
> >> -
https://issues.jboss.org/browse/ISPN-1599
> >> As you need to separate the locks of entries from the effective user
> >> facing lock, at least to implement transactions on top of this
> model.
>
>
> > I think we fixed ISPN-1599 when we changed
passivation to use
> > DataContainer.compute(). WDYT Pedro, is there anything else you'd
> like to do
> > in the scope of ISPN-1599?
>
> >
> >> I expect this to improve performance in a
very significant way, but
> >> it's getting embarrassing that it's still not done; at the next
face
> >> to face meeting we should also reserve some time for retrospective
> >> sessions.
>
>
> > Implementing the state machine-based interceptor
stack may give us a
> > performance boost, but I'm much more certain that it's a very
> complex, high
> > risk task... and we don't have a stable test suite yet :)
> Cleaning up and removing some complexity such as
> TooManyExecutorsException might help to get it stable, and keep it
> there :)
> BTW it was quite stable for me until you changed the JGroups UDP
> default configuration.
> Do you really use UDP to run
the tests? The default is TCP, but maybe
> the some tests doesn't use TestCacheManagerFactory...
> I was just aligning our configs with Bela's
recommandations: MERGE3
> instead of MERGE2 and the removal of UFC in TCP stacks. If they cause
> problems on your machine, you should make more noise :)
> Dan
> Sanne
>
>
> >
>
>
> >> Sanne
> >
> >> On 29 July 2014
15:50, Bela Ban <bban(a)redhat.com
> <mailto:bban@redhat.com>> wrote:
> >>
> >>
> >> > On 29/07/14 16:42, Dan Berindei wrote:
> >> >> Have you tried regular optimistic/pessimistic transactions as
> well?
> >>
> >> > Yes, in my
first impl. but since I'm making only 1 change per
> request, I
> >> > thought a TX is overkill.
> >>
> >> >> They
*should* have less issues with the OOB thread pool than
> non-tx
> >> >> mode, and
> >> >> I'm quite curious how they stack against TO in such a
large
> cluster.
> >>
> >> > Why would
they have fewer issues with the thread pools ? AIUI,
> a TX
> >> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when
> not using
> >> > TXs. And we're sync anyway...
> >>
> >>
> >> >> On Tue, Jul 29, 2014 at 5:38 PM,
Bela Ban <bban(a)redhat.com
> <mailto:bban@redhat.com
>
>> >> <mailto:bban@redhat.com <mailto:bban@redhat.com>>>
wrote:
> >> >
> >> >>
Following up on my own email, I changed the config to use
> Pedro's
> >> >> excellent total order implementation:
> >> >
> >> >>
<transaction transactionMode="TRANSACTIONAL"
> >> >> transactionProtocol="TOTAL_ORDER"
lockingMode="OPTIMISTIC"
> >> >> useEagerLocking="true"
eagerLockSingleNode="true"
>
>> >> <recovery enabled="false"/
> >> >
>
>> >> With 100 nodes and 25 requester threads/node, I did NOT
> run into
> >> >> any
> >> >> locking issues !
> >> >
> >> >> I
could even go up to 200 requester threads/node and the
> perf was ~
> >> >> 7'000-8'000 requests/sec/node. Not too bad !
> >> >
> >> >> This
really validates the concept of lockless total-order
> >> >> dissemination
> >> >> of TXs; for the first time, this has been tested on a
> large(r)
> >> >> scale
> >> >> (previously only on 25 nodes) and IT WORKS ! :-)
> >> >
> >> >> I
still believe we should implement my suggested solution for
> >> >> non-TO
> >> >> configs, but short of configuring thread pools of 1000
> threads or
> >> >> higher, I hope TO will allow me to finally test a 500
node
> >> >> Infinispan
> >> >> cluster !
> >> >
> >> >
> >> >> On 29/07/14 15:56, Bela Ban
wrote:
> >> >> > Hi guys,
> >> >>
> >> >>
> sorry for the long post, but I do think I ran into an
> important
> >> >> problem
> >> >> > and we need to fix it ... :-)
> >> >>
> >> >>
> I've spent the last couple of days running the
> IspnPerfTest [1]
> >> >> perftest
> >> >> > on Google Compute Engine (GCE), and I've run
into a
> problem with
> >> >> > Infinispan. It is a design problem and can be
mitigated by
> >> >> sizing
> >> >> thread
> >> >> > pools correctly, but cannot be eliminated entirely.
> >> >>
> >> >>
> >> >> > Symptom:
> >> >> > --------
> >> >> > IspnPerfTest has every node in a cluster perform
> 20'000 requests
> >> >> on keys
> >> >> > in range [1..20000].
> >> >>
> >> >>
> 80% of the requests are reads and 20% writes.
> >> >>
> >> >>
> By default, we have 25 requester threads per node and
> 100 nodes
> >> >> in a
> >> >> > cluster, so a total of 2500 requester threads.
> >> >>
> >> >>
> The cache used is NON-TRANSACTIONAL / dist-sync / 2
> owners:
> >> >>
> >> >>
> <namedCache name="clusteredCache"
>
>> >> > <clustering mode="distribution"
> >> >> >
<stateTransfer awaitInitialTransfer="true"/
>
>> >> > <hash numOwners="2"/
> >> >> > <sync
replTimeout="20000"/
> >> >>
> </clustering
> >> >>
> >> >> > <transaction
transactionMode="NON_TRANSACTIONAL"
> >> >> > useEagerLocking="true"
> >> >> > eagerLockSingleNode="true"
/
> >> >> > <locking
lockAcquisitionTimeout="5000"
> >> >> concurrencyLevel="1000"
> >> >> >
isolationLevel="READ_COMMITTED"
> >> >> useLockStriping="false" /
> >> >> > </namedCache
> >> >>
>
>> >> > It has 2 owners, a lock acquisition timeout of 5s and
> a repl
> >> >> timeout of
> >> >> > 20s. Lock stripting is off, so we have 1 lock per
key.
> >> >>
> >> >>
> When I run the test, I always get errors like those below:
> >> >>
> >> >>
> org.infinispan.util.concurrent.TimeoutException: Unable to
> >> >> acquire lock
> >> >> > after [10 seconds] on key [19386] for requestor
> >> >> [Thread[invoker-3,5,main]]!
> >> >> > Lock held by
[Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
> >> >>
> >> >>
> and
> >> >>
> >> >>
> org.infinispan.util.concurrent.TimeoutException: Node
> m8.1 timed
> >> >> out
> >> >>
> >> >>
> >> >> > Investigation:
> >> >> > ------------
> >> >> > When I looked at UNICAST3, I saw a lot of missing
> messages on
> >> >> the
> >> >> > receive side and unacked messages on the send side.
> This caused
> >> >> me to
> >> >> > look into the (mainly OOB) thread pools and - voila
-
> maxed out
> >> >> !
> >> >>
> >> >>
> I learned from Pedro that the Infinispan internal
> thread pool
> >> >> (with a
> >> >> > default of 32 threads) can be configured, so I
> increased it to
> >> >> 300 and
> >> >> > increased the OOB pools as well.
> >> >>
> >> >>
> This mitigated the problem somewhat, but when I
> increased the
> >> >> requester
> >> >> > threads to 100, I had the same problem again.
> Apparently, the
> >> >> Infinispan
> >> >> > internal thread pool uses a rejection policy of
"run"
> and thus
> >> >> uses the
> >> >> > JGroups (OOB) thread when exhausted.
> >> >>
> >> >>
> I learned (from Pedro and Mircea) that GETs and PUTs
> work as
> >> >> follows in
> >> >> > dist-sync / 2 owners:
> >> >> > - GETs are sent to the primary and backup owners
and
> the first
> >> >> response
> >> >> > received is returned to the caller. No locks are
> acquired, so
> >> >> GETs
> >> >> > shouldn't cause problems.
> >> >>
> >> >>
> - A PUT(K) is sent to the primary owner of K
> >> >> > - The primary owner
> >> >> > (1) locks K
> >> >> > (2) updates the backup owner synchronously
> *while holding
> >> >> the lock*
> >> >> > (3) releases the lock
> >> >>
> >> >>
> >> >> > Hypothesis
> >> >> > ----------
> >> >> > (2) above is done while holding the lock. The sync
> update of the
> >> >> backup
> >> >> > owner is done with the lock held to guarantee that
the
> primary
> >> >> and
> >> >> > backup owner of K have the same values for K.
> >> >>
> >> >>
> However, the sync update *inside the lock scope* slows
> things
> >> >> down (can
> >> >> > it also lead to deadlocks?); there's the risk
that the
> request
> >> >> is
> >> >> > dropped due to a full incoming thread pool, or that
> the response
> >> >> is not
> >> >> > received because of the same, or that the locking
at
> the backup
> >> >> owner
> >> >> > blocks for some time.
> >> >>
> >> >>
> If we have many threads modifying the same key, then
> we have a
> >> >> backlog
> >> >> > of locking work against that key. Say we have 100
> requester
> >> >> threads and
> >> >> > a 100 node cluster. This means that we have
10'000 threads
> >> >> accessing
> >> >> > keys; with 2'000 writers there's a big
chance that
> some writers
> >> >> pick the
> >> >> > same key at the same time.
> >> >>
> >> >>
> For example, if we have 100 threads accessing key K
> and it takes
> >> >> 3ms to
> >> >> > replicate K to the backup owner, then the last of
the 100
> >> >> threads
> >> >> waits
> >> >> > ~300ms before it gets a chance to lock K on the
> primary owner
> >> >> and
> >> >> > replicate it as well.
> >> >>
> >> >>
> Just a small hiccup in sending the PUT to the primary
> owner,
> >> >> sending the
> >> >> > modification to the backup owner, waitting for the
> response, or
> >> >> GC, and
> >> >> > the delay will quickly become bigger.
> >> >>
> >> >>
> >> >> > Verification
> >> >> > ----------
> >> >> > To verify the above, I set numOwners to 1. This
means
> that the
> >> >> primary
> >> >> > owner of K does *not* send the modification to the
> backup owner,
> >> >> it only
> >> >> > locks K, modifies K and unlocks K again.
> >> >>
> >> >>
> I ran the IspnPerfTest again on 100 nodes, with 25
> requesters,
> >> >> and NO
> >> >> > PROBLEM !
> >> >>
> >> >>
> I then increased the requesters to 100, 150 and 200
> and the test
> >> >> > completed flawlessly ! Performance was around
*40'000
> requests
> >> >> per node
> >> >> > per sec* on 4-core boxes !
> >> >>
> >> >>
> >> >> > Root cause
> >> >> > ---------
> >> >> > *******************
> >> >> > The root cause is the sync RPC of K to the backup
> owner(s) of K
> >> >> while
> >> >> > the primary owner holds the lock for K.
> >> >> > *******************
> >> >>
> >> >>
> This causes a backlog of threads waiting for the lock
> and that
> >> >> backlog
> >> >> > can grow to exhaust the thread pools. First the
Infinispan
> >> >> internal
> >> >> > thread pool, then the JGroups OOB thread pool. The
> latter causes
> >> >> > retransmissions to get dropped, which compounds the
> problem...
> >> >>
> >> >>
> >> >> > Goal
> >> >> > ----
> >> >> > The goal is to make sure that primary and backup
> owner(s) of K
> >> >> have the
> >> >> > same value for K.
> >> >>
> >> >>
> Simply sending the modification to the backup owner(s)
> >> >> asynchronously
> >> >> > won't guarantee this, as modification messages
might get
> >> >> processed out
> >> >> > of order as they're OOB !
> >> >>
> >> >>
> >> >> > Suggested solution
> >> >> > ----------------
> >> >> > The modification RPC needs to be invoked *outside
of
> the lock
> >> >> scope*:
> >> >> > - lock K
> >> >> > - modify K
> >> >> > - unlock K
> >> >> > - send modification to backup owner(s) // outside
the
> lock scope
> >> >>
> >> >>
> The primary owner puts the modification of K into a
> queue from
> >> >> where a
> >> >> > separate thread/task removes it. The thread then
> invokes the
> >> >> PUT(K) on
> >> >> > the backup owner(s).
> >> >>
> >> >>
> The queue has the modified keys in FIFO order, so the
> >> >> modifications
> >> >> > arrive at the backup owner(s) in the right order.
> >> >>
> >> >>
> This requires that the way GET is implemented changes
> slightly:
> >> >> instead
> >> >> > of invoking a GET on all owners of K, we only
invoke
> it on the
> >> >> primary
> >> >> > owner, then the next-in-line etc.
> >> >>
> >> >>
> The reason for this is that the backup owner(s) may
> not yet have
> >> >> > received the modification of K.
> >> >>
> >> >>
> This is a better impl anyway (we discussed this
> before) becuse
> >> >> it
> >> >> > generates less traffic; in the normal case, all but
1 GET
> >> >> requests are
> >> >> > unnecessary.
> >> >>
> >> >>
> >> >>
>
>> >> > Improvement
> >> >> > -----------
> >> >> > The above solution can be simplified and even made
more
> >> >> efficient.
> >> >> > Re-using concepts from IRAC [2], we can simply store
the
> >> >> modified
> >> >> *keys*
> >> >> > in the modification queue. The modification
> replication thread
> >> >> removes
> >> >> > the key, gets the current value and invokes a
> PUT/REMOVE on the
> >> >> backup
> >> >> > owner(s).
> >> >>
> >> >>
> Even better: a key is only ever added *once*, so if we
> have
> >> >> [5,2,17,3],
> >> >> > adding key 2 is a no-op because the processing of
key
> 2 (in
> >> >> second
> >> >> > position in the queue) will fetch the up-to-date
value
> anyway !
> >> >>
> >> >>
> >> >> > Misc
> >> >> > ----
> >> >> > - Could we possibly use total order to send the
> updates in TO ?
> >> >> TBD (Pedro?)
> >> >>
> >> >>
> >> >> > Thoughts ?
> >> >>
> >> >>
> >> >> > [1]
https://github.com/belaban/IspnPerfTest
> >> >> > [2]
> >> >>
> >> >
> >> >
>
https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-...
> >> >>
> >> >
> >> >> --
> >> >> Bela Ban, JGroups lead (
http://www.jgroups.org)
> >> >> _______________________________________________
> >> >> infinispan-dev mailing list
> >> >> infinispan-dev(a)lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org
>
>> >> <mailto:infinispan-dev@lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org>
>
>> >>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >
> >> >
> >> >
>
>> >
> >> >>
_______________________________________________
> >> >> infinispan-dev mailing list
> >> >> infinispan-dev(a)lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org
>
>> >>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >
> >>
> >> > --
> >> > Bela Ban, JGroups lead (
http://www.jgroups.org)
> >> > _______________________________________________
> >> > infinispan-dev mailing list
> >> > infinispan-dev(a)lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org
>
>> >
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> infinispan-dev(a)lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org
>
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
>
> _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev(a)lists.jboss.org
> <mailto:infinispan-dev@lists.jboss.org
>
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org <mailto:infinispan-dev@lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev