[infinispan-dev] Threadpools in a large cluster

Fri Feb 1 03:59:34 EST 2013

Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible.

Let's see what someone more informed (Dan?) would think about that.

Thanks, Bela

Radim

----- Original Message -----
| From: "Bela Ban" <bban at redhat.com>
| To: infinispan-dev at lists.jboss.org
| Sent: Friday, February 1, 2013 9:39:43 AM
| Subject: Re: [infinispan-dev] Threadpools in a large cluster
| 
| It looks like the core problem is an incoming RPC-1 which triggers
| another blocking RPC-2: the thread delivering RPC-1 is blocked
| waiting
| for the response from RPC-2, and can therefore not be used to serve
| other requests for the duration of RPC-2. If RPC-2 takes a while,
| e.g.
| waiting to acquire a lock in the remote node, then it is clear that
| the
| thread pool will quickly exceed its max size.
| 
| A simple solution would be to prevent invoking blocking RPCs *from
| within* a received RPC. Let's take a look at an example:
| - A invokes a blocking PUT-1 on B
| - B forwards the request as blocking PUT-2 to C and D
| - When PUT-2 returns and B gets the responses from C and D (or the
| first
| one to respond, don't know exactly how this is implemented), it sends
| the response back to A (PUT-1 terminates now at A)
| 
| We could change this to the following:
| - A invokes a blocking PUT-1 on B
| - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D,
| it
| does the following:
|       - B invokes PUT-2 and gets a future
|       - B adds itself as a FutureListener, and it also stores the
| address of the original sender (A)
|       - When the FutureListener is invoked, B sends back the result
|       as a
| response to A
| - Whenever a member leaves the cluster, the corresponding futures are
| cancelled and removed from the hashmaps
| 
| This could probably be done differently (e.g. by sending asynchronous
| messages and implementing a finite state machine), but the core of
| the
| solution is the same; namely to avoid having an incoming thread block
| on
| a sync RPC.
| 
| Thoughts ?
| 
| 
| 
| 
| On 2/1/13 9:04 AM, Radim Vansa wrote:
| > Hi guys,
| >
| > after dealing with the large cluster for a while I find the way how
| > we use OOB threads in synchronous configuration non-robust.
| > Imagine a situation where node which is not an owner of the key
| > calls PUT. Then the a RPC is called to the primary owner of that
| > key, which reroutes the request to all other owners and after
| > these reply, it replies back.
| > There are two problems:
| > 1) If we do simultanously X requests from non-owners to the primary
| > owner where X is OOB TP size, all the OOB threads are waiting for
| > the responses and there is no thread to process the OOB response
| > and release the thread.
| > 2) Node A is primary owner of keyA, non-primary owner of keyB and B
| > is primary of keyB and non-primary of keyA. We got many requests
| > for both keyA and keyB from other nodes, therefore, all OOB
| > threads from both nodes call RPC to the non-primary owner but
| > there's noone who could process the request.
| >
| > While we wait for the requests to timeout, the nodes with depleted
| > OOB threadpools start suspecting all other nodes because they
| > can't receive heartbeats etc...
| >
| > You can say "increase your OOB tp size", but that's not always an
| > option, I have currently set it to 1000 threads and it's not
| > enough. In the end, I will be always limited by RAM and something
| > tells me that even nodes with few gigs of RAM should be able to
| > form a huge cluster. We use 160 hotrod worker threads in JDG, that
| > means that 160 * clusterSize = 10240 (64 nodes in my cluster)
| > parallel requests can be executed, and if 10% targets the same
| > node with 1000 OOB threads, it stucks. It's about scaling and
| > robustness.
| >
| > Not that I'd have any good solution, but I'd really like to start a
| > discussion.
| > Thinking about it a bit, the problem is that blocking call (calling
| > RPC on primary owner from message handler) can block non-blocking
| > calls (such as RPC response or command that never sends any more
| > messages). Therefore, having a flag on message "this won't send
| > another message" could let the message be executed in different
| > threadpool, which will be never deadlocked. In fact, the pools
| > could share the threads but the non-blocking would have always a
| > few threads spare.
| > It's a bad solution as maintaining which message could block in the
| > other node is really, really hard (we can be sure only in case of
| > RPC responses), especially when some locks come. I will welcome
| > anything better.
| 
| --
| Bela Ban, JGroups lead (http://www.jgroups.org)
| 
| _______________________________________________
| infinispan-dev mailing list
| infinispan-dev at lists.jboss.org
| https://lists.jboss.org/mailman/listinfo/infinispan-dev
|