[infinispan-dev] Threadpools in a large cluster

Mon Feb 4 02:46:54 EST 2013

On Sun, Feb 3, 2013 at 1:23 PM, Bela Ban <bban at redhat.com> wrote:

> A new thread pool owned by Infinispan is certainly something desirable,
> as discussed in Palma, but I think it wouldn't solve the issue Radim ran
> into, namely threads being used despite the fact that they only wait for
> another blocking RPC to finish.
>
>
IMO the fact that threads are being blocked waiting for an RPC to return is
not a big deal. The real problem is when all the OOB threads are used up,
causing a deadlock: existing OOB threads are blocked waiting for RPC
responses, and the RPC responses are blocked by until a OOB thread is freed.

> If we made the JGroups thread return immediately by transferring control
> to an Infinispan thread, then we'd simply move the issue from the former
> to the latter pool. Eventually, the Infinispan pool would run out of
> threads.
>
>
Yeah, but JGroups would still be able to process RPC responses, and by
doing that it will free some of the OOB threads.

For transactional caches there's an additional benefit: if commit/rollback
commands were handled directly on the OOB pool then neither thread pool
would have dependencies between tasks, so we could enable queueing for the
OOB pool and the Infinispan pool without causing deadlocks.

> Coming back to the specific problem Radim ran into: the forwarding of a
> PUT doesn't hold any locks, so your argument below wouldn't hold.
> However, of course this is only one specific scenario, and you're
> probably right that we'd have to consider the more general case of a
> thread holding locks...
>
>
Actually, NonTransactionalLockingInterceptor acquires a lock on the key
before the execution of the RPC (from
NonTxConcurrentDistributionInterceptor), and is keeping that lock for the
entire duration of the RPC.

We make other RPCs while holding the key lock as well, particularly to
invalidate the L1 entries.

> All said, I believe it would still be worthwhile looking into a more
> non-blocking way of invoking RPCs, that doesn't occupy threads which
> essentially only wait on IO (network traffic)... A simple state machine
> approach could be the solution to this...
>
>
Switching to a state machine approach would require rethinking and
rewriting all our interceptors, and I'm pretty sure the code would get more
complex and harder to debug (to say nothing about interpreting the logs).
Are you sure it's going to have that many benefits to make it worthwhile?

> On 2/1/13 10:54 AM, Dan Berindei wrote:
> > Yeah, I wouldn't call this a "simple" solution...
> >
> > The distribution/replication interceptors are quite high in the
> > interceptor stack, so we'd have to save the state of the interceptor
> > stack (basically the thread's stack) somehow and resume processing it
> > on the thread receiving the responses. In a language that supports
> > continuations that would be a piece of cake, but since we're in Java
> > we'd have to completely change the way the interceptor stack works.
> >
> > Actually we do hold the lock on modified keys while the command is
> > replicated to the other owners. But think locking wouldn't be a
> > problem: we already allow locks to be owned by transactions instead of
> > threads, so it would just be a matter of creating a "lite transaction"
> > for non-transactional caches. Obviously the
> > TransactionSynchronizerInterceptor would have to go, but I see that as
> > a positive thing ;)
> >
> > So yeah, it could work, but it would take a huge amount of effort and
> > it's going to obfuscate the code. Plus, I'm not at all convinced that
> > it's going to improve performance that much compared to a new thread
> pool.
> >
> > Cheers
> > Dan
> >
> >
> > On Fri, Feb 1, 2013 at 10:59 AM, Radim Vansa <rvansa at redhat.com
> > <mailto:rvansa at redhat.com>> wrote:
> >
> >     Yeah, that would work if it is possible to break execution path
> >     into the FutureListener from the middle of interceptor stack - I
> >     am really not sure about that but as in current design no locks
> >     should be held when a RPC is called, it may be possible.
> >
> >     Let's see what someone more informed (Dan?) would think about that.
> >
> >     Thanks, Bela
> >
> >     Radim
> >
> >     ----- Original Message -----
> >     | From: "Bela Ban" <bban at redhat.com <mailto:bban at redhat.com>>
> >     | To: infinispan-dev at lists.jboss.org
> >     <mailto:infinispan-dev at lists.jboss.org>
> >     | Sent: Friday, February 1, 2013 9:39:43 AM
> >     | Subject: Re: [infinispan-dev] Threadpools in a large cluster
> >     |
> >     | It looks like the core problem is an incoming RPC-1 which triggers
> >     | another blocking RPC-2: the thread delivering RPC-1 is blocked
> >     | waiting
> >     | for the response from RPC-2, and can therefore not be used to serve
> >     | other requests for the duration of RPC-2. If RPC-2 takes a while,
> >     | e.g.
> >     | waiting to acquire a lock in the remote node, then it is clear that
> >     | the
> >     | thread pool will quickly exceed its max size.
> >     |
> >     | A simple solution would be to prevent invoking blocking RPCs *from
> >     | within* a received RPC. Let's take a look at an example:
> >     | - A invokes a blocking PUT-1 on B
> >     | - B forwards the request as blocking PUT-2 to C and D
> >     | - When PUT-2 returns and B gets the responses from C and D (or the
> >     | first
> >     | one to respond, don't know exactly how this is implemented), it
> >     sends
> >     | the response back to A (PUT-1 terminates now at A)
> >     |
> >     | We could change this to the following:
> >     | - A invokes a blocking PUT-1 on B
> >     | - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and
> D,
> >     | it
> >     | does the following:
> >     |       - B invokes PUT-2 and gets a future
> >     |       - B adds itself as a FutureListener, and it also stores the
> >     | address of the original sender (A)
> >     |       - When the FutureListener is invoked, B sends back the result
> >     |       as a
> >     | response to A
> >     | - Whenever a member leaves the cluster, the corresponding
> >     futures are
> >     | cancelled and removed from the hashmaps
> >     |
> >     | This could probably be done differently (e.g. by sending
> >     asynchronous
> >     | messages and implementing a finite state machine), but the core of
> >     | the
> >     | solution is the same; namely to avoid having an incoming thread
> >     block
> >     | on
> >     | a sync RPC.
> >     |
> >     | Thoughts ?
> >     |
> >     |
> >     |
> >     |
> >     | On 2/1/13 9:04 AM, Radim Vansa wrote:
> >     | > Hi guys,
> >     | >
> >     | > after dealing with the large cluster for a while I find the
> >     way how
> >     | > we use OOB threads in synchronous configuration non-robust.
> >     | > Imagine a situation where node which is not an owner of the key
> >     | > calls PUT. Then the a RPC is called to the primary owner of that
> >     | > key, which reroutes the request to all other owners and after
> >     | > these reply, it replies back.
> >     | > There are two problems:
> >     | > 1) If we do simultanously X requests from non-owners to the
> >     primary
> >     | > owner where X is OOB TP size, all the OOB threads are waiting for
> >     | > the responses and there is no thread to process the OOB response
> >     | > and release the thread.
> >     | > 2) Node A is primary owner of keyA, non-primary owner of keyB
> >     and B
> >     | > is primary of keyB and non-primary of keyA. We got many requests
> >     | > for both keyA and keyB from other nodes, therefore, all OOB
> >     | > threads from both nodes call RPC to the non-primary owner but
> >     | > there's noone who could process the request.
> >     | >
> >     | > While we wait for the requests to timeout, the nodes with
> depleted
> >     | > OOB threadpools start suspecting all other nodes because they
> >     | > can't receive heartbeats etc...
> >     | >
> >     | > You can say "increase your OOB tp size", but that's not always an
> >     | > option, I have currently set it to 1000 threads and it's not
> >     | > enough. In the end, I will be always limited by RAM and something
> >     | > tells me that even nodes with few gigs of RAM should be able to
> >     | > form a huge cluster. We use 160 hotrod worker threads in JDG,
> that
> >     | > means that 160 * clusterSize = 10240 (64 nodes in my cluster)
> >     | > parallel requests can be executed, and if 10% targets the same
> >     | > node with 1000 OOB threads, it stucks. It's about scaling and
> >     | > robustness.
> >     | >
> >     | > Not that I'd have any good solution, but I'd really like to
> >     start a
> >     | > discussion.
> >     | > Thinking about it a bit, the problem is that blocking call
> >     (calling
> >     | > RPC on primary owner from message handler) can block non-blocking
> >     | > calls (such as RPC response or command that never sends any more
> >     | > messages). Therefore, having a flag on message "this won't send
> >     | > another message" could let the message be executed in different
> >     | > threadpool, which will be never deadlocked. In fact, the pools
> >     | > could share the threads but the non-blocking would have always a
> >     | > few threads spare.
> >     | > It's a bad solution as maintaining which message could block
> >     in the
> >     | > other node is really, really hard (we can be sure only in case of
> >     | > RPC responses), especially when some locks come. I will welcome
> >     | > anything better.
> >     |
> >     | --
> >     | Bela Ban, JGroups lead (http://www.jgroups.org)
> >     |
> >     | _______________________________________________
> >     | infinispan-dev mailing list
> >     | infinispan-dev at lists.jboss.org
> >     <mailto:infinispan-dev at lists.jboss.org>
> >     | https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >     |
> >     _______________________________________________
> >     infinispan-dev mailing list
> >     infinispan-dev at lists.jboss.org <mailto:
> infinispan-dev at lists.jboss.org>
> >     https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130204/c7c03aec/attachment.html