On 1 Feb 2013, at 09:39, Dan Berindei <dan.berindei(a)gmail.com> wrote:
Radim, do these problems happen with the HotRod server, or only with
memcached?
HotRod requests handled by non-owners should be very rare, instead the vast majority
should be handled by the primary owner directly. So if this happens with HotRod, we should
focus on fixing the HotRod routing instead of focusing on how to handle a large number of
requests from non-owners.
Well, even Hot Rod only optionally uses smart routing. Some client libraries don't
have this capability.
That being said, even if a HotRod put request is handled by the primary owner, it
"generates" (numOwners - 1) extra OOB requests. So if you have 160 HotRod worker
threads per node, you can expect 4 * 160 OOB messages per node. Multiply that by 2,
because responses are OOB as well, and you can get 1280 OOB messages before you even start
reusing any HotRod worker thread. Have you tried decreasing the number of HotRod workers?
The thing is, our OOB thread pool can't use queueing because we'd get a queue
full of commit commands while all the OOB threads are waiting on keys that those commit
commands would unlock. As the OOB thread pool is full, we discard messages, which I
suspect slows things down quite a bit (especially if it's a credit request/response
message). So it may well be that a lower number of HotRod working threads would perform
better.
On the other hand, why is increasing the number of OOB threads a solution? With -Xss
512k, you can get 2000 threads with only 1 GB of virtual memory (the actual used memory is
probably even less, unless you're using huge pages). AFAIK the Linux kernel
doesn't break a sweat with 100000 threads running, so having 2000 threads just hanging
around, waiting for a response, should be such a problem.
I did chat with Bela (or was it a break-out session?) about moving Infinispan's
request processing to another thread pool during the team meeting in Palma. That would
leave the OOB thread pool free to receive response messages, FD heartbeats, credit
requests/responses etc. The downside, I guess, is that each request would have to be
passed to another thread, and the context switch may slow things down a bit. But since the
new thread pool would be in Infinispan, we could even do tricks like executing a
commit/rollback directly on the OOB thread.
Right. I always got the impression we were abusing the OOB pool. But in the end, I think
it makes sense (in JGroups) to separate a service thread pool (for heartbeats, credits,
etc) and an application thread pool (what we'd use instead of OOB). This way you
could even tune your service thread pool to just have, say, 2 threads, and the application
thread pool to 1000 or whatever.
In the end, I just didn't feel that working on this was
justified, considering the number of critical bugs we had. But maybe now's the time to
start experimenting…
On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa <rvansa(a)redhat.com> wrote:
Hi guys,
after dealing with the large cluster for a while I find the way how we use OOB threads in
synchronous configuration non-robust.
Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC
is called to the primary owner of that key, which reroutes the request to all other owners
and after these reply, it replies back.
There are two problems:
1) If we do simultanously X requests from non-owners to the primary owner where X is OOB
TP size, all the OOB threads are waiting for the responses and there is no thread to
process the OOB response and release the thread.
2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB
and non-primary of keyA. We got many requests for both keyA and keyB from other nodes,
therefore, all OOB threads from both nodes call RPC to the non-primary owner but
there's noone who could process the request.
While we wait for the requests to timeout, the nodes with depleted OOB threadpools start
suspecting all other nodes because they can't receive heartbeats etc...
You can say "increase your OOB tp size", but that's not always an option, I
have currently set it to 1000 threads and it's not enough. In the end, I will be
always limited by RAM and something tells me that even nodes with few gigs of RAM should
be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that
160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and
if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and
robustness.
Not that I'd have any good solution, but I'd really like to start a discussion.
Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner
from message handler) can block non-blocking calls (such as RPC response or command that
never sends any more messages). Therefore, having a flag on message "this won't
send another message" could let the message be executed in different threadpool,
which will be never deadlocked. In fact, the pools could share the threads but the
non-blocking would have always a few threads spare.
It's a bad solution as maintaining which message could block in the other node is
really, really hard (we can be sure only in case of RPC responses), especially when some
locks come. I will welcome anything better.
Radim
-----------------------------------------------------------
Radim Vansa
Quality Assurance Engineer
JBoss Datagrid
tel. +420532294559 ext. 62559
Red Hat Czech, s.r.o.
Brno, Purkyňova 99/71, PSČ 612 45
Czech Republic
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Platform Architect, JBoss Data Grid
http://red.ht/data-grid