[infinispan-dev] Threadpools in a large cluster

Fri Feb 1 08:19:11 EST 2013

On Fri, Feb 1, 2013 at 12:13 PM, Manik Surtani <msurtani at redhat.com> wrote:

>
> On 1 Feb 2013, at 09:39, Dan Berindei <dan.berindei at gmail.com> wrote:
>
> > Radim, do these problems happen with the HotRod server, or only with
> memcached?
> >
> > HotRod requests handled by non-owners should be very rare, instead the
> vast majority should be handled by the primary owner directly. So if this
> happens with HotRod, we should focus on fixing the HotRod routing instead
> of focusing on how to handle a large number of requests from non-owners.
>
> Well, even Hot Rod only optionally uses smart routing.  Some client
> libraries don't have this capability.
>
>
True, and I meant to say that with memcached it should be much worse, but
at least in Radim's tests I hope smart routing is enabled.

> >
> > That being said, even if a HotRod put request is handled by the primary
> owner, it "generates" (numOwners - 1) extra OOB requests. So if you have
> 160 HotRod worker threads per node, you can expect 4 * 160 OOB messages per
> node. Multiply that by 2, because responses are OOB as well, and you can
> get 1280 OOB messages before you even start reusing any HotRod worker
> thread. Have you tried decreasing the number of HotRod workers?
> >
> > The thing is, our OOB thread pool can't use queueing because we'd get a
> queue full of commit commands while all the OOB threads are waiting on keys
> that those commit commands would unlock. As the OOB thread pool is full, we
> discard messages, which I suspect slows things down quite a bit (especially
> if it's a credit request/response message). So it may well be that a lower
> number of HotRod working threads would perform better.
> >
> > On the other hand, why is increasing the number of OOB threads a
> solution? With -Xss 512k, you can get 2000 threads with only 1 GB of
> virtual memory (the actual used memory is probably even less, unless you're
> using huge pages). AFAIK the Linux kernel doesn't break a sweat with 100000
> threads running, so having 2000 threads just hanging around, waiting for a
> response, should be such a problem.
> >
> > I did chat with Bela (or was it a break-out session?) about moving
> Infinispan's request processing to another thread pool during the team
> meeting in Palma. That would leave the OOB thread pool free to receive
> response messages, FD heartbeats, credit requests/responses etc. The
> downside, I guess, is that each request would have to be passed to another
> thread, and the context switch may slow things down a bit. But since the
> new thread pool would be in Infinispan, we could even do tricks like
> executing a commit/rollback directly on the OOB thread.
>
> Right.  I always got the impression we were abusing the OOB pool.  But in
> the end, I think it makes sense (in JGroups) to separate a service thread
> pool (for heartbeats, credits, etc) and an application thread pool (what
> we'd use instead of OOB).  This way you could even tune your service thread
> pool to just have, say, 2 threads, and the application thread pool to 1000
> or whatever.
>
>
A separate service pool would be good, but I think we could go further and
treat ClusteredGet/Commit/Rollback commands the same way, because they
can't block waiting for other commands to be processed.

> > In the end, I just didn't feel that working on this was justified,
> considering the number of critical bugs we had. But maybe now's the time to
> start experimenting…
> >
> >
> >
> > On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa <rvansa at redhat.com> wrote:
> > Hi guys,
> >
> > after dealing with the large cluster for a while I find the way how we
> use OOB threads in synchronous configuration non-robust.
> > Imagine a situation where node which is not an owner of the key calls
> PUT. Then the a RPC is called to the primary owner of that key, which
> reroutes the request to all other owners and after these reply, it replies
> back.
> > There are two problems:
> > 1) If we do simultanously X requests from non-owners to the primary
> owner where X is OOB TP size, all the OOB threads are waiting for the
> responses and there is no thread to process the OOB response and release
> the thread.
> > 2) Node A is primary owner of keyA, non-primary owner of keyB and B is
> primary of keyB and non-primary of keyA. We got many requests for both keyA
> and keyB from other nodes, therefore, all OOB threads from both nodes call
> RPC to the non-primary owner but there's noone who could process the
> request.
> >
> > While we wait for the requests to timeout, the nodes with depleted OOB
> threadpools start suspecting all other nodes because they can't receive
> heartbeats etc...
> >
> > You can say "increase your OOB tp size", but that's not always an
> option, I have currently set it to 1000 threads and it's not enough. In the
> end, I will be always limited by RAM and something tells me that even nodes
> with few gigs of RAM should be able to form a huge cluster. We use 160
> hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64
> nodes in my cluster) parallel requests can be executed, and if 10% targets
> the same node with 1000 OOB threads, it stucks. It's about scaling and
> robustness.
> >
> > Not that I'd have any good solution, but I'd really like to start a
> discussion.
> > Thinking about it a bit, the problem is that blocking call (calling RPC
> on primary owner from message handler) can block non-blocking calls (such
> as RPC response or command that never sends any more messages). Therefore,
> having a flag on message "this won't send another message" could let the
> message be executed in different threadpool, which will be never
> deadlocked. In fact, the pools could share the threads but the non-blocking
> would have always a few threads spare.
> > It's a bad solution as maintaining which message could block in the
> other node is really, really hard (we can be sure only in case of RPC
> responses), especially when some locks come. I will welcome anything better.
> >
> > Radim
> >
> >
> > -----------------------------------------------------------
> > Radim Vansa
> > Quality Assurance Engineer
> > JBoss Datagrid
> > tel. +420532294559 ext. 62559
> >
> > Red Hat Czech, s.r.o.
> > Brno, Purkyňova 99/71, PSČ 612 45
> > Czech Republic
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik at jboss.org
> twitter.com/maniksurtani
>
> Platform Architect, JBoss Data Grid
> http://red.ht/data-grid
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130201/791ed082/attachment.html