On Wed, Jul 30, 2014 at 12:35 AM, Sanne Grinovero <sanne@infinispan.org> wrote:
On 29 July 2014 22:14, Dan Berindei <dan.berindei@gmail.com> wrote:
>
> On Tue, Jul 29, 2014 at 9:06 PM, Sanne Grinovero <sanne@infinispan.org>
> wrote:
>>
>> This is a nasty problem and I also feel passionately we need to get
>> rid of it ASAP.
>> I did have the same problems many times, and we discussed this also in
>> Farnborough; AFAIR Dan and Pedro had some excellent ideas to fix this.
>>
>> You don't need TO, and you don't need to lock at all as long as you
>> guarantee the backup owners are getting the number with some
>> monotonicity sequence attached to it,
>> all that backup owners need to do is ignore incoming commands which
>> are outdated.
>
>
> This is more or less what TOA does - assign a monotonic sequence number to
> txs, and only apply them after all the previous txs in the sequence have
> been applied. The problem is getting that monotonic sequence when there are
> multiple originators and multiple primary owners also requires some extra
> RPCs.

Let's not mix this up with requirements for other areas.

The strategy I've proposed is only to be applied for the communication
from the primary owner to its backups:
the value to be written is well known as it's the primary owner which
defines it unilaterally (for example if there is an atomic replacement
to be computed)
and there is no need for extra RPCs as the sequence is not related to
a group of changes but for the specific entry only.

There is no such thing as a need for consensus across owners, nor need
for a central source for sequences.

Just to make sure I understand correctly: your proposal is that in non-tx caches, the primary owner should generate some sort of version number while holding the entry lock, and replicate the write to the backup owners synchronously, but without holding the lock? Then the backup owners would check the version of the entry and only apply a newer write?

if your proposal is only meant to apply to non-tx caches, you are right you don't have to worry about multiple primary owners... most of the time. But when the primary owner changes, then you do have 2 primary owners (if the new primary owner installs the new topology first), and you do need to coordinate between the 2.

Slightly related: we also considered generating a version number on the client for consistency when the HotRod client retries after a primary owner failure [1]. But the clients can't create a monotonic sequence number, so we couldn't use that version number for this.

[1] https://issues.jboss.org/browse/ISPN-2956


Also I don't see it as an alternative to TOA, I rather expect it to
work nicely together: when TOA is enabled you could trust the
originating sequence source rather than generate a per-entry sequence,
and in neither case you need to actually use a Lock.
I haven't thought how the sequences would need to interact (if they
need), but they seem complementary to resolve different aspects, and
also both benefit from the same cleanup and basic structure.

We don't acquire locks at all on the backup owners - either in tx or non-tx caches. If state transfer is in progress, we use ConcurrentHashMap.compute() to store tracking information, which uses a synchronized block, so I suppose we do acquire locks. I assume your proposal would require a DataContainer.compute() or something similar on the backups, to ensure that the version check and the replacement are atomic.

I still think TOA does what you want for tx caches. Your proposal would only work for non-tx caches, so you couldn't use them together.


>> Another aspect is that the "user thread" on the primary owner needs to
>> wait (at least until we improve further) and only proceed after ACK
>> from backup nodes, but this is better modelled through a state
>> machine. (Also discussed in Farnborough).
>
>
> To be clear, I don't think keeping the user thread on the originator blocked
> until we have the write confirmations from all the backups is a problem - a
> sync operation has to block, and it also serves to rate-limit user
> operations.


There are better ways to rate-limit than to make all operations slow;
we don't need to block a thread, we need to react on the reply from
the backup owners.
You still have an inherent rate-limit in the outgoing packet queues:
if these fill up, then and only then it's nice to introduce some back
pressure.

Sorry, you got me confused when you called the thread on the primary owner a "user thread". I agree that internal stuff can and should be asynchronous, callback based, but the user still has to see a synchronous blocking operation.


> The problem appears when the originator is not the primary owner, and the
> thread blocking for backup ACKs is from the remote-executor pool (or OOB,
> when the remote-executor pool is exhausted).

Not following. I guess this is out of scope now that I clarified the
proposed solution is only to be applied between primary and backups?


Yeah, I was just trying to clarify that there is no danger of exhausting the remote executor/OOB thread pools when the originator of the write command is the primary owner (as it happens in the HotRod server).
 

>>
>> It's also conceptually linked to:
>>  - https://issues.jboss.org/browse/ISPN-1599
>> As you need to separate the locks of entries from the effective user
>> facing lock, at least to implement transactions on top of this model.
>
>
> I think we fixed ISPN-1599 when we changed passivation to use
> DataContainer.compute(). WDYT Pedro, is there anything else you'd like to do
> in the scope of ISPN-1599?
>
>>
>> I expect this to improve performance in a very significant way, but
>> it's getting embarrassing that it's still not done; at the next face
>> to face meeting we should also reserve some time for retrospective
>> sessions.
>
>
> Implementing the state machine-based interceptor stack may give us a
> performance boost, but I'm much more certain that it's a very complex, high
> risk task... and we don't have a stable test suite yet :)

Cleaning up and removing some complexity such as
TooManyExecutorsException might help to get it stable, and keep it
there :)
BTW it was quite stable for me until you changed the JGroups UDP
default configuration.


Do you really use UDP to run the tests? The default is TCP, but maybe the some tests doesn't use TestCacheManagerFactory...

I was just aligning our configs with Bela's recommandations: MERGE3 instead of MERGE2 and the removal of UFC in TCP stacks. If they cause problems on your machine, you should make more noise :)

Dan

 
Sanne

>
>
>>
>>
>> Sanne
>>
>> On 29 July 2014 15:50, Bela Ban <bban@redhat.com> wrote:
>> >
>> >
>> > On 29/07/14 16:42, Dan Berindei wrote:
>> >> Have you tried regular optimistic/pessimistic transactions as well?
>> >
>> > Yes, in my first impl. but since I'm making only 1 change per request, I
>> > thought a TX is overkill.
>> >
>> >> They *should* have less issues with the OOB thread pool than non-tx
>> >> mode, and
>> >> I'm quite curious how they stack against TO in such a large cluster.
>> >
>> > Why would they have fewer issues with the thread pools ? AIUI, a TX
>> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when not using
>> > TXs. And we're sync anyway...
>> >
>> >
>> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <bban@redhat.com
>> >> <mailto:bban@redhat.com>> wrote:
>> >>
>> >>     Following up on my own email, I changed the config to use Pedro's
>> >>     excellent total order implementation:
>> >>
>> >>     <transaction transactionMode="TRANSACTIONAL"
>> >>     transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC"
>> >>     useEagerLocking="true" eagerLockSingleNode="true">
>> >>                   <recovery enabled="false"/>
>> >>
>> >>     With 100 nodes and 25 requester threads/node, I did NOT run into
>> >> any
>> >>     locking issues !
>> >>
>> >>     I could even go up to 200 requester threads/node and the perf was ~
>> >>     7'000-8'000 requests/sec/node. Not too bad !
>> >>
>> >>     This really validates the concept of lockless total-order
>> >> dissemination
>> >>     of TXs; for the first time, this has been tested on a large(r)
>> >> scale
>> >>     (previously only on 25 nodes) and IT WORKS ! :-)
>> >>
>> >>     I still believe we should implement my suggested solution for
>> >> non-TO
>> >>     configs, but short of configuring thread pools of 1000 threads or
>> >>     higher, I hope TO will allow me to finally test a 500 node
>> >> Infinispan
>> >>     cluster !
>> >>
>> >>
>> >>     On 29/07/14 15:56, Bela Ban wrote:
>> >>      > Hi guys,
>> >>      >
>> >>      > sorry for the long post, but I do think I ran into an important
>> >>     problem
>> >>      > and we need to fix it ... :-)
>> >>      >
>> >>      > I've spent the last couple of days running the IspnPerfTest [1]
>> >>     perftest
>> >>      > on Google Compute Engine (GCE), and I've run into a problem with
>> >>      > Infinispan. It is a design problem and can be mitigated by
>> >> sizing
>> >>     thread
>> >>      > pools correctly, but cannot be eliminated entirely.
>> >>      >
>> >>      >
>> >>      > Symptom:
>> >>      > --------
>> >>      > IspnPerfTest has every node in a cluster perform 20'000 requests
>> >>     on keys
>> >>      > in range [1..20000].
>> >>      >
>> >>      > 80% of the requests are reads and 20% writes.
>> >>      >
>> >>      > By default, we have 25 requester threads per node and 100 nodes
>> >> in a
>> >>      > cluster, so a total of 2500 requester threads.
>> >>      >
>> >>      > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners:
>> >>      >
>> >>      > <namedCache name="clusteredCache">
>> >>      >        <clustering mode="distribution">
>> >>      >            <stateTransfer awaitInitialTransfer="true"/>
>> >>      >            <hash numOwners="2"/>
>> >>      >            <sync replTimeout="20000"/>
>> >>      >        </clustering>
>> >>      >
>> >>      >        <transaction transactionMode="NON_TRANSACTIONAL"
>> >>      > useEagerLocking="true"
>> >>      >             eagerLockSingleNode="true"  />
>> >>      >        <locking lockAcquisitionTimeout="5000"
>> >> concurrencyLevel="1000"
>> >>      >                 isolationLevel="READ_COMMITTED"
>> >>     useLockStriping="false" />
>> >>      > </namedCache>
>> >>      >
>> >>      > It has 2 owners, a lock acquisition timeout of 5s and a repl
>> >>     timeout of
>> >>      > 20s. Lock stripting is off, so we have 1 lock per key.
>> >>      >
>> >>      > When I run the test, I always get errors like those below:
>> >>      >
>> >>      > org.infinispan.util.concurrent.TimeoutException: Unable to
>> >>     acquire lock
>> >>      > after [10 seconds] on key [19386] for requestor
>> >>     [Thread[invoker-3,5,main]]!
>> >>      > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]]
>> >>      >
>> >>      > and
>> >>      >
>> >>      > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed
>> >> out
>> >>      >
>> >>      >
>> >>      > Investigation:
>> >>      > ------------
>> >>      > When I looked at UNICAST3, I saw a lot of missing messages on
>> >> the
>> >>      > receive side and unacked messages on the send side. This caused
>> >> me to
>> >>      > look into the (mainly OOB) thread pools and - voila - maxed out
>> >> !
>> >>      >
>> >>      > I learned from Pedro that the Infinispan internal thread pool
>> >> (with a
>> >>      > default of 32 threads) can be configured, so I increased it to
>> >>     300 and
>> >>      > increased the OOB pools as well.
>> >>      >
>> >>      > This mitigated the problem somewhat, but when I increased the
>> >>     requester
>> >>      > threads to 100, I had the same problem again. Apparently, the
>> >>     Infinispan
>> >>      > internal thread pool uses a rejection policy of "run" and thus
>> >>     uses the
>> >>      > JGroups (OOB) thread when exhausted.
>> >>      >
>> >>      > I learned (from Pedro and Mircea) that GETs and PUTs work as
>> >>     follows in
>> >>      > dist-sync / 2 owners:
>> >>      > - GETs are sent to the primary and backup owners and the first
>> >>     response
>> >>      > received is returned to the caller. No locks are acquired, so
>> >> GETs
>> >>      > shouldn't cause problems.
>> >>      >
>> >>      > - A PUT(K) is sent to the primary owner of K
>> >>      > - The primary owner
>> >>      >        (1) locks K
>> >>      >        (2) updates the backup owner synchronously *while holding
>> >>     the lock*
>> >>      >        (3) releases the lock
>> >>      >
>> >>      >
>> >>      > Hypothesis
>> >>      > ----------
>> >>      > (2) above is done while holding the lock. The sync update of the
>> >>     backup
>> >>      > owner is done with the lock held to guarantee that the primary
>> >> and
>> >>      > backup owner of K have the same values for K.
>> >>      >
>> >>      > However, the sync update *inside the lock scope* slows things
>> >>     down (can
>> >>      > it also lead to deadlocks?); there's the risk that the request
>> >> is
>> >>      > dropped due to a full incoming thread pool, or that the response
>> >>     is not
>> >>      > received because of the same, or that the locking at the backup
>> >> owner
>> >>      > blocks for some time.
>> >>      >
>> >>      > If we have many threads modifying the same key, then we have a
>> >>     backlog
>> >>      > of locking work against that key. Say we have 100 requester
>> >>     threads and
>> >>      > a 100 node cluster. This means that we have 10'000 threads
>> >> accessing
>> >>      > keys; with 2'000 writers there's a big chance that some writers
>> >>     pick the
>> >>      > same key at the same time.
>> >>      >
>> >>      > For example, if we have 100 threads accessing key K and it takes
>> >>     3ms to
>> >>      > replicate K to the backup owner, then the last of the 100
>> >> threads
>> >>     waits
>> >>      > ~300ms before it gets a chance to lock K on the primary owner
>> >> and
>> >>      > replicate it as well.
>> >>      >
>> >>      > Just a small hiccup in sending the PUT to the primary owner,
>> >>     sending the
>> >>      > modification to the backup owner, waitting for the response, or
>> >>     GC, and
>> >>      > the delay will quickly become bigger.
>> >>      >
>> >>      >
>> >>      > Verification
>> >>      > ----------
>> >>      > To verify the above, I set numOwners to 1. This means that the
>> >>     primary
>> >>      > owner of K does *not* send the modification to the backup owner,
>> >>     it only
>> >>      > locks K, modifies K and unlocks K again.
>> >>      >
>> >>      > I ran the IspnPerfTest again on 100 nodes, with 25 requesters,
>> >> and NO
>> >>      > PROBLEM !
>> >>      >
>> >>      > I then increased the requesters to 100, 150 and 200 and the test
>> >>      > completed flawlessly ! Performance was around *40'000 requests
>> >>     per node
>> >>      > per sec* on 4-core boxes !
>> >>      >
>> >>      >
>> >>      > Root cause
>> >>      > ---------
>> >>      > *******************
>> >>      > The root cause is the sync RPC of K to the backup owner(s) of K
>> >> while
>> >>      > the primary owner holds the lock for K.
>> >>      > *******************
>> >>      >
>> >>      > This causes a backlog of threads waiting for the lock and that
>> >>     backlog
>> >>      > can grow to exhaust the thread pools. First the Infinispan
>> >> internal
>> >>      > thread pool, then the JGroups OOB thread pool. The latter causes
>> >>      > retransmissions to get dropped, which compounds the problem...
>> >>      >
>> >>      >
>> >>      > Goal
>> >>      > ----
>> >>      > The goal is to make sure that primary and backup owner(s) of K
>> >>     have the
>> >>      > same value for K.
>> >>      >
>> >>      > Simply sending the modification to the backup owner(s)
>> >> asynchronously
>> >>      > won't guarantee this, as modification messages might get
>> >>     processed out
>> >>      > of order as they're OOB !
>> >>      >
>> >>      >
>> >>      > Suggested solution
>> >>      > ----------------
>> >>      > The modification RPC needs to be invoked *outside of the lock
>> >> scope*:
>> >>      > - lock K
>> >>      > - modify K
>> >>      > - unlock K
>> >>      > - send modification to backup owner(s) // outside the lock scope
>> >>      >
>> >>      > The primary owner puts the modification of K into a queue from
>> >>     where a
>> >>      > separate thread/task removes it. The thread then invokes the
>> >>     PUT(K) on
>> >>      > the backup owner(s).
>> >>      >
>> >>      > The queue has the modified keys in FIFO order, so the
>> >> modifications
>> >>      > arrive at the backup owner(s) in the right order.
>> >>      >
>> >>      > This requires that the way GET is implemented changes slightly:
>> >>     instead
>> >>      > of invoking a GET on all owners of K, we only invoke it on the
>> >>     primary
>> >>      > owner, then the next-in-line etc.
>> >>      >
>> >>      > The reason for this is that the backup owner(s) may not yet have
>> >>      > received the modification of K.
>> >>      >
>> >>      > This is a better impl anyway (we discussed this before) becuse
>> >> it
>> >>      > generates less traffic; in the normal case, all but 1 GET
>> >>     requests are
>> >>      > unnecessary.
>> >>      >
>> >>      >
>> >>      >
>> >>      > Improvement
>> >>      > -----------
>> >>      > The above solution can be simplified and even made more
>> >> efficient.
>> >>      > Re-using concepts from IRAC [2], we can simply store the
>> >> modified
>> >>     *keys*
>> >>      > in the modification queue. The modification replication thread
>> >>     removes
>> >>      > the key, gets the current value and invokes a PUT/REMOVE on the
>> >>     backup
>> >>      > owner(s).
>> >>      >
>> >>      > Even better: a key is only ever added *once*, so if we have
>> >>     [5,2,17,3],
>> >>      > adding key 2 is a no-op because the processing of key 2 (in
>> >> second
>> >>      > position in the queue) will fetch the up-to-date value anyway !
>> >>      >
>> >>      >
>> >>      > Misc
>> >>      > ----
>> >>      > - Could we possibly use total order to send the updates in TO ?
>> >>     TBD (Pedro?)
>> >>      >
>> >>      >
>> >>      > Thoughts ?
>> >>      >
>> >>      >
>> >>      > [1] https://github.com/belaban/IspnPerfTest
>> >>      > [2]
>> >>      >
>> >>
>> >> https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering
>> >>      >
>> >>
>> >>     --
>> >>     Bela Ban, JGroups lead (http://www.jgroups.org)
>> >>     _______________________________________________
>> >>     infinispan-dev mailing list
>> >>     infinispan-dev@lists.jboss.org
>> >> <mailto:infinispan-dev@lists.jboss.org>
>> >>     https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> infinispan-dev mailing list
>> >> infinispan-dev@lists.jboss.org
>> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >>
>> >
>> > --
>> > Bela Ban, JGroups lead (http://www.jgroups.org)
>> > _______________________________________________
>> > infinispan-dev mailing list
>> > infinispan-dev@lists.jboss.org
>> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev@lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev