[infinispan-dev] Threadpools in a large cluster

Fri Feb 1 05:40:11 EST 2013

| 
| Radim, do these problems happen with the HotRod server, or only with
| memcached?

I didn't test memcached, only HotRod. The thing I was seing were many OOB threads stuck when sending messages from handleRemoteWrite.

| 
| HotRod requests handled by non-owners should be very rare, instead
| the vast majority should be handled by the primary owner directly.
| So if this happens with HotRod, we should focus on fixing the HotRod
| routing instead of focusing on how to handle a large number of
| requests from non-owners.
| 
| 
| 
| That being said, even if a HotRod put request is handled by the
| primary owner, it "generates" (numOwners - 1) extra OOB requests. So
| if you have 160 HotRod worker threads per node, you can expect 4 *
| 160 OOB messages per node. Multiply that by 2, because responses are
| OOB as well, and you can get 1280 OOB messages before you even start
| reusing any HotRod worker thread. Have you tried decreasing the
| number of HotRod workers?

Decreasing the number of workers would be obvious solution how to scale it down. To be honest, I haven't tried that because it would certainly lower the overall throughput, and it is not a systematic solution IMO.

| 
| The thing is, our OOB thread pool can't use queueing because we'd get
| a queue full of commit commands while all the OOB threads are
| waiting on keys that those commit commands would unlock. As the OOB
| thread pool is full, we discard messages, which I suspect slows
| things down quite a bit (especially if it's a credit
| request/response message). So it may well be that a lower number of
| HotRod working threads would perform better.

We have already had similar talk where you convinced me that having queue for the thread pool wouldn't help much.

| 
| On the other hand, why is increasing the number of OOB threads a
| solution? With -Xss 512k, you can get 2000 threads with only 1 GB of
| virtual memory (the actual used memory is probably even less, unless
| you're using huge pages). AFAIK the Linux kernel doesn't break a
| sweat with 100000 threads running, so having 2000 threads just
| hanging around, waiting for a response, should be such a problem.
| 
I don't say that it won't work (you're right that it's just virtual memory), but I have thought that Infinispan should scale and be robust even for the edge cases.

| 
| I did chat with Bela (or was it a break-out session?) about moving
| Infinispan's request processing to another thread pool during the
| team meeting in Palma. That would leave the OOB thread pool free to
| receive response messages, FD heartbeats, credit requests/responses
| etc. The downside, I guess, is that each request would have to be
| passed to another thread, and the context switch may slow things
| down a bit. But since the new thread pool would be in Infinispan, we
| could even do tricks like executing a commit/rollback directly on
| the OOB thread.

Hmm, for some messages (nonblocking) the context switch could be spared. It depends how complicated is to determine whether the message will block before entering the interceptor chain.

| 
| 
| In the end, I just didn't feel that working on this was justified,
| considering the number of critical bugs we had. But maybe now's the
| time to start experimenting...
| 

I agree and I'm happy that ISPN is mostly working now.

I have tried to rerun the scenario with upper OOB limit 2000 and it did not help (originaly I was using 200 and increased to 1000), node stops responding in one moment... So maybe OOB is not the only villain. I'll keep investigating.

Radim

| 
| 
| 
| 
| 
| On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa < rvansa at redhat.com >
| wrote:
| 
| 
| Hi guys,
| 
| after dealing with the large cluster for a while I find the way how
| we use OOB threads in synchronous configuration non-robust.
| Imagine a situation where node which is not an owner of the key calls
| PUT. Then the a RPC is called to the primary owner of that key,
| which reroutes the request to all other owners and after these
| reply, it replies back.
| There are two problems:
| 1) If we do simultanously X requests from non-owners to the primary
| owner where X is OOB TP size, all the OOB threads are waiting for
| the responses and there is no thread to process the OOB response and
| release the thread.
| 2) Node A is primary owner of keyA, non-primary owner of keyB and B
| is primary of keyB and non-primary of keyA. We got many requests for
| both keyA and keyB from other nodes, therefore, all OOB threads from
| both nodes call RPC to the non-primary owner but there's noone who
| could process the request.
| 
| While we wait for the requests to timeout, the nodes with depleted
| OOB threadpools start suspecting all other nodes because they can't
| receive heartbeats etc...
| 
| You can say "increase your OOB tp size", but that's not always an
| option, I have currently set it to 1000 threads and it's not enough.
| In the end, I will be always limited by RAM and something tells me
| that even nodes with few gigs of RAM should be able to form a huge
| cluster. We use 160 hotrod worker threads in JDG, that means that
| 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests
| can be executed, and if 10% targets the same node with 1000 OOB
| threads, it stucks. It's about scaling and robustness.
| 
| Not that I'd have any good solution, but I'd really like to start a
| discussion.
| Thinking about it a bit, the problem is that blocking call (calling
| RPC on primary owner from message handler) can block non-blocking
| calls (such as RPC response or command that never sends any more
| messages). Therefore, having a flag on message "this won't send
| another message" could let the message be executed in different
| threadpool, which will be never deadlocked. In fact, the pools could
| share the threads but the non-blocking would have always a few
| threads spare.
| It's a bad solution as maintaining which message could block in the
| other node is really, really hard (we can be sure only in case of
| RPC responses), especially when some locks come. I will welcome
| anything better.
| 
| Radim
| 
| 
| -----------------------------------------------------------
| Radim Vansa
| Quality Assurance Engineer
| JBoss Datagrid
| tel. +420532294559 ext. 62559
| 
| Red Hat Czech, s.r.o.
| Brno, Purkyňova 99/71, PSČ 612 45
| Czech Republic
| 
| 
| _______________________________________________
| infinispan-dev mailing list
| infinispan-dev at lists.jboss.org
| https://lists.jboss.org/mailman/listinfo/infinispan-dev
| 
| _______________________________________________
| infinispan-dev mailing list
| infinispan-dev at lists.jboss.org
| https://lists.jboss.org/mailman/listinfo/infinispan-dev