On Fri, Feb 1, 2013 at 12:13 PM, Manik Surtani <msurtani(a)redhat.com> wrote:
On 1 Feb 2013, at 09:39, Dan Berindei <dan.berindei(a)gmail.com> wrote:
> Radim, do these problems happen with the HotRod server, or only with
memcached?
>
> HotRod requests handled by non-owners should be very rare, instead the
vast majority should be handled by the primary owner directly. So if this
happens with HotRod, we should focus on fixing the HotRod routing instead
of focusing on how to handle a large number of requests from non-owners.
Well, even Hot Rod only optionally uses smart routing. Some client
libraries don't have this capability.
True, and I meant to say that with memcached it should be much worse, but
at least in Radim's tests I hope smart routing is enabled.
>
> That being said, even if a HotRod put request is handled by the primary
owner, it "generates" (numOwners - 1) extra OOB requests. So if you have
160 HotRod worker threads per node, you can expect 4 * 160 OOB messages per
node. Multiply that by 2, because responses are OOB as well, and you can
get 1280 OOB messages before you even start reusing any HotRod worker
thread. Have you tried decreasing the number of HotRod workers?
>
> The thing is, our OOB thread pool can't use queueing because we'd get a
queue full of commit commands while all the OOB threads are waiting on keys
that those commit commands would unlock. As the OOB thread pool is full, we
discard messages, which I suspect slows things down quite a bit (especially
if it's a credit request/response message). So it may well be that a lower
number of HotRod working threads would perform better.
>
> On the other hand, why is increasing the number of OOB threads a
solution? With -Xss 512k, you can get 2000 threads with only 1 GB of
virtual memory (the actual used memory is probably even less, unless you're
using huge pages). AFAIK the Linux kernel doesn't break a sweat with 100000
threads running, so having 2000 threads just hanging around, waiting for a
response, should be such a problem.
>
> I did chat with Bela (or was it a break-out session?) about moving
Infinispan's request processing to another thread pool during the team
meeting in Palma. That would leave the OOB thread pool free to receive
response messages, FD heartbeats, credit requests/responses etc. The
downside, I guess, is that each request would have to be passed to another
thread, and the context switch may slow things down a bit. But since the
new thread pool would be in Infinispan, we could even do tricks like
executing a commit/rollback directly on the OOB thread.
Right. I always got the impression we were abusing the OOB pool. But in
the end, I think it makes sense (in JGroups) to separate a service thread
pool (for heartbeats, credits, etc) and an application thread pool (what
we'd use instead of OOB). This way you could even tune your service thread
pool to just have, say, 2 threads, and the application thread pool to 1000
or whatever.
A separate service pool would be good, but I think we could go further and
treat ClusteredGet/Commit/Rollback commands the same way, because they
can't block waiting for other commands to be processed.
> In the end, I just didn't feel that working on this was
justified,
considering the number of critical bugs we had. But maybe now's the time to
start experimenting…
>
>
>
> On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa <rvansa(a)redhat.com> wrote:
> Hi guys,
>
> after dealing with the large cluster for a while I find the way how we
use OOB threads in synchronous configuration non-robust.
> Imagine a situation where node which is not an owner of the key calls
PUT. Then the a RPC is called to the primary owner of that key, which
reroutes the request to all other owners and after these reply, it replies
back.
> There are two problems:
> 1) If we do simultanously X requests from non-owners to the primary
owner where X is OOB TP size, all the OOB threads are waiting for the
responses and there is no thread to process the OOB response and release
the thread.
> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is
primary of keyB and non-primary of keyA. We got many requests for both keyA
and keyB from other nodes, therefore, all OOB threads from both nodes call
RPC to the non-primary owner but there's noone who could process the
request.
>
> While we wait for the requests to timeout, the nodes with depleted OOB
threadpools start suspecting all other nodes because they can't receive
heartbeats etc...
>
> You can say "increase your OOB tp size", but that's not always an
option, I have currently set it to 1000 threads and it's not enough. In the
end, I will be always limited by RAM and something tells me that even nodes
with few gigs of RAM should be able to form a huge cluster. We use 160
hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64
nodes in my cluster) parallel requests can be executed, and if 10% targets
the same node with 1000 OOB threads, it stucks. It's about scaling and
robustness.
>
> Not that I'd have any good solution, but I'd really like to start a
discussion.
> Thinking about it a bit, the problem is that blocking call (calling RPC
on primary owner from message handler) can block non-blocking calls (such
as RPC response or command that never sends any more messages). Therefore,
having a flag on message "this won't send another message" could let the
message be executed in different threadpool, which will be never
deadlocked. In fact, the pools could share the threads but the non-blocking
would have always a few threads spare.
> It's a bad solution as maintaining which message could block in the
other node is really, really hard (we can be sure only in case of RPC
responses), especially when some locks come. I will welcome anything better.
>
> Radim
>
>
> -----------------------------------------------------------
> Radim Vansa
> Quality Assurance Engineer
> JBoss Datagrid
> tel. +420532294559 ext. 62559
>
> Red Hat Czech, s.r.o.
> Brno, Purkyňova 99/71, PSČ 612 45
> Czech Republic
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Platform Architect, JBoss Data Grid
http://red.ht/data-grid
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev