[infinispan-dev] Threadpools in a large cluster

Bela Ban bban at redhat.com
Fri Feb 1 03:39:43 EST 2013


It looks like the core problem is an incoming RPC-1 which triggers 
another blocking RPC-2: the thread delivering RPC-1 is blocked waiting 
for the response from RPC-2, and can therefore not be used to serve 
other requests for the duration of RPC-2. If RPC-2 takes a while, e.g. 
waiting to acquire a lock in the remote node, then it is clear that the 
thread pool will quickly exceed its max size.

A simple solution would be to prevent invoking blocking RPCs *from 
within* a received RPC. Let's take a look at an example:
- A invokes a blocking PUT-1 on B
- B forwards the request as blocking PUT-2 to C and D
- When PUT-2 returns and B gets the responses from C and D (or the first 
one to respond, don't know exactly how this is implemented), it sends 
the response back to A (PUT-1 terminates now at A)

We could change this to the following:
- A invokes a blocking PUT-1 on B
- B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, it 
does the following:
      - B invokes PUT-2 and gets a future
      - B adds itself as a FutureListener, and it also stores the 
address of the original sender (A)
      - When the FutureListener is invoked, B sends back the result as a 
response to A
- Whenever a member leaves the cluster, the corresponding futures are 
cancelled and removed from the hashmaps

This could probably be done differently (e.g. by sending asynchronous 
messages and implementing a finite state machine), but the core of the 
solution is the same; namely to avoid having an incoming thread block on 
a sync RPC.

Thoughts ?




On 2/1/13 9:04 AM, Radim Vansa wrote:
> Hi guys,
>
> after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust.
> Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back.
> There are two problems:
> 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread.
> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request.
>
> While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc...
>
> You can say "increase your OOB tp size", but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness.
>
> Not that I'd have any good solution, but I'd really like to start a discussion.
> Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message "this won't send another message" could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare.
> It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better.

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)



More information about the infinispan-dev mailing list