[infinispan-dev] Threadpools in a large cluster

Dan Berindei dan.berindei at gmail.com
Thu Feb 7 05:09:55 EST 2013


On Thu, Feb 7, 2013 at 6:53 AM, Bela Ban <bban at redhat.com> wrote:

> Hi Pedro,
>
> this is almost exactly what I wanted to implement !
>
> Question:
> - In RequestCorrelator.handleRequest():
>
> protected void handleRequest(Message req, Header hdr) {
> Object retval;
> boolean threwException = false;
> MessageRequest messageRequest = new MessageRequestImpl(req, hdr);
> try {
> retval=request_handler.handle(messageRequest);
> } catch(Throwable t) {
> retval=t;
> threwException = true;
> }
> messageRequest.sendReply(retval, threwException);// <-- should be moved
> up, or called only if threwException == true
> }
>
>
> , you create a MessageRequestImpl and pass it to the RequestHandler. The
> request handler then dispatches the request (possibly) to a thread pool
> and calls MessageRequestImpl.sendReply() when done.
>
> However, you also call MessageRequest.sendReply() before returning from
> handleRequest(). I think this is an error, and
> MessageRequest.sendReply() should be moved up inside the catch clause,
> or be called only if threwException is true, so that we send a reply on
> behalf of the RequestHandler if and only if it threw an exception (e.g.
> before it dispatches the request to a thread pool). Otherwise, we'd send
> a reply *twice* !
>
> A few changes I have in mind (need to think about it more):
>
> - I want to leave the existing RequestHandler interface in place, so
> current implementation continue to work
> - There will be a new AsyncRequestHandler interface (possibly extending
> RequestHandler, so an implementation can decide to implement both). The
> RequestCorrelator needs to have either request_handler or
> async_request_handler set. If the former is set, the logic is unchanged.
> If the latter is set I'll invoke the async dispatching code
>
> - AsyncRequestHandler will look similar to the following:
> void handle(Message request, Handback hb, boolean requires_response)
> throws Throwable;
> - Handback is an interface, and its impl contains header information
> (e.g. request ID)
> - Handback has a sendReply(Object reply, boolean is_exception) method
> which sends a response (or exception) back to the caller
>

+1 for a new interface. TBH I hadn't read the RequestCorrelator code, so I
had assumed it was already asynchronous, and only RpcDispatcher was
synchronous.

I'm not so sure about the Handback name, how about calling it Response
instead?



> - When requires_response is false, the AsyncRequestHandler doesn't need
> to invoke sendReply()
>
>
I think this should be the other way around: when requires_response is
true, the AsyncRequestHandler *can* invoke sendReply(), but is not required
to (the call will just time out on the caller node); when requires_response
is false, invoking sendReply() should throw an exception.


- Message batching
> - The above interfaces need to take message batching into account, e.g.
> the ability to handle multiple requests concurrently (if they don't need
> to be executed sequentially)
>
>
You mean handle() is still going to be called once for each request, but
second handle() call won't necessarily wait for the first message's
sendReply() call?

Is this going to apply only to OOB messages, or to regular messages as
well? I think I'd prefer it if it only applied to OOB messages, otherwise
we'd have to implement our own ordering for regular/async commands.



>
> Thoughts ?
>
>
>

> On 2/6/13 8:29 PM, Pedro Ruivo wrote:
> > Hi all,
> >
> > Recently I came up with a solution that can help with the thread pool
> > problem motivated by the following:
> >
> > In one of the first implementation of Total Order based commit
> > protocol (TO), I had the requirement to move the PrepareCommand to
> > another thread pool. In resume, the TO protocol delivers the
> > PrepareCommand in a deterministic order in all the nodes, by a single
> > deliver thread. To ensure consistency, if it delivers two conflicting
> > transactions, the second transaction must wait until the first
> > transaction finishes. However, blocking single deliver thread is not a
> > good solution, because no more transaction can be validated, even if
> > they don't conflict, while the thread is blocked.
> >
> > So, after creating a dependency graph (i.e. the second transaction
> > knows that it must wait for the first transaction to finish) I move
> > the PrepareCommand to another thread pool. Initially, I implemented a
> > new command, called PrepareResponseCommand, that sends back the reply
> > of the PrepareCommand. This solution has one disadvantage: I had to
> > implement an ack collector in ISPN, while JGroups already offers me
> > that with a synchronous communication.
> >
> > Recently (2 or 3 months ago) I implemented a simple modification in
> > JGroups. In a more generic approach, it allows other threads to reply
> > to a RPC request (such as the PrepareCommand). In the previous
> > scenario, I replaced the PrepareResponseCommand and the ack collector
> > implementation with a synchronous RPC invocation. I've used this
> > solution in other issues in the Cloud-TM's ISPN fork.
> >
> > This solution is quite simple to implement and may help you to move
> > the commands to ISPN internal thread pools. The modifications I've
> > made are the following:
> >
> > 1) I added a new interface (see [1]) that is sent to the application
> > instead of the Message object (see [4]). This interface contains the
> > Message and it has a method to allow the application send the reply to
> > that particular request.
> > 2) I added a new object in [4] with the meaning: this return value is
> > not the reply to the RPC request. This is the returned value that I
> > return when I want to release the thread, because ISPN should return
> > some object in the handle() method. Of course, I know that ISPN will
> > invoke the sendReply() in some other place, otherwise, I will get a
> > TimeoutException in the sender side.
> > 3) Also I've changed the RequestCorrelator implementation to support
> > the previous modifications (see [2] and [3])
> >
> > In the Cloud-TM's ISPN fork I added a reference in the
> > BaseCacheRpcCommand to [1] and added the method sendReply() [5]. In
> > addition, I have the following uses cases working perfectly with this:
> >
> > 1) Total Order
> >
> > The scenario described in the beginning. The ParallelTotalOrderManager
> > returns the DO_NOT_REPLY object when it receives a remote
> > PrepareCommand (see [6] line 77). When the PrepareCommand is finally
> > processed by the rest of the interceptor chain, it invokes the
> > PreapreCommand.sendReply() (see [6] line 230).
> >
> > 2) GMU remote get
> >
> > GMU ensures SERIALIZABLE Isolation Level and the remote gets must
> > ensure that the node that is processing the request has a minimum
> > version available to ensure data consistency. The problem in ours
> > initial implementation in large cluster, is the number of remote gets
> > are very high and all the OOB are being blocked because of this
> condition.
> >
> > Same thing I've done with the ClusteredRemoteGet as you can in see
> > [7], line 93 and 105.
> >
> > 3) GMU CommitCommand
> >
> > In GMU, the CommitCommand cannot be processed by any order. If T1 is
> > serialized before T2, the commit command of T1 must be processed
> > before the commit command of T2, even if the transactions do not have
> > conflicts. This generates the same problem above and the same solution
> > was adopted.
> >
> > I know that you have discussed some solutions and I would like to know
> > what it is your opinion about what I've described.
> >
> > If you have questions, please let me know.
> >
> > Cheers,
> > Pedro
> >
> > [1]
> >
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java
> > <
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java
> >
> > [2]
> >
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463
> > <
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463
> >
> > [3]
> >
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495
> > <
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495
> >
> > [4]
> >
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java
> > <
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java
> >
> > [5]
> >
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75
> > <
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75
> >
> > [6]
> >
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/transaction/totalorder/ParallelTotalOrderManager.java
> > [7]
> >
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/GMUClusteredGetCommand.java
> >
> >
> > On 2/3/13 11:35 AM, Bela Ban wrote:
> >> If you send me the details, I'll take a look. I'm pretty busy with
> >> message batching, so I can't promise next week, but soon...
> >>
> >> On 2/1/13 11:08 AM, Pedro Ruivo wrote:
> >>> Hi,
> >>>
> >>> I had a similar problem when I tried GMU[1] in "large" cluster (40
> vms),
> >>> because the remote gets and the commit messages (I'm talking about ISPN
> >>> commands) must wait for some conditions before being processed.
> >>>
> >>> I solved this problem by adding a feature in JGroups[2] that allows the
> >>> request to be moved to another thread, releasing the OOB thread. The
> >>> other thread will send the reply of the JGroups Request. Of course, I'm
> >>> only moving commands that I know they can block.
> >>>
> >>> I can enter in some detail if you want =)
> >>>
> >>> Cheers,
> >>> Pedro
> >>>
> >>> [1]http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf
> >>> [2] I would like to talk with Bela about this, because it makes my life
> >>> easier to support total order in ISPN. I'll try to send an email this
> >>> weekend =)
> >>>
> >>> On 01-02-2013 08:04, Radim Vansa wrote:
> >>>> Hi guys,
> >>>>
> >>>> after dealing with the large cluster for a while I find the way how
> we use OOB threads in synchronous configuration non-robust.
> >>>> Imagine a situation where node which is not an owner of the key calls
> PUT. Then the a RPC is called to the primary owner of that key, which
> reroutes the request to all other owners and after these reply, it replies
> back.
> >>>> There are two problems:
> >>>> 1) If we do simultanously X requests from non-owners to the primary
> owner where X is OOB TP size, all the OOB threads are waiting for the
> responses and there is no thread to process the OOB response and release
> the thread.
> >>>> 2) Node A is primary owner of keyA, non-primary owner of keyB and B
> is primary of keyB and non-primary of keyA. We got many requests for both
> keyA and keyB from other nodes, therefore, all OOB threads from both nodes
> call RPC to the non-primary owner but there's noone who could process the
> request.
> >>>>
> >>>> While we wait for the requests to timeout, the nodes with depleted
> OOB threadpools start suspecting all other nodes because they can't receive
> heartbeats etc...
> >>>>
> >>>> You can say "increase your OOB tp size", but that's not always an
> option, I have currently set it to 1000 threads and it's not enough. In the
> end, I will be always limited by RAM and something tells me that even nodes
> with few gigs of RAM should be able to form a huge cluster. We use 160
> hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64
> nodes in my cluster) parallel requests can be executed, and if 10% targets
> the same node with 1000 OOB threads, it stucks. It's about scaling and
> robustness.
> >>>>
> >>>> Not that I'd have any good solution, but I'd really like to start a
> discussion.
> >>>> Thinking about it a bit, the problem is that blocking call (calling
> RPC on primary owner from message handler) can block non-blocking calls
> (such as RPC response or command that never sends any more messages).
> Therefore, having a flag on message "this won't send another message" could
> let the message be executed in different threadpool, which will be never
> deadlocked. In fact, the pools could share the threads but the non-blocking
> would have always a few threads spare.
> >>>> It's a bad solution as maintaining which message could block in the
> other node is really, really hard (we can be sure only in case of RPC
> responses), especially when some locks come. I will welcome anything better.
> >>>>
> >>>> Radim
> >>>>
> >>>>
> >>>> -----------------------------------------------------------
> >>>> Radim Vansa
> >>>> Quality Assurance Engineer
> >>>> JBoss Datagrid
> >>>> tel. +420532294559 ext. 62559
> >>>>
> >>>> Red Hat Czech, s.r.o.
> >>>> Brno, Purkyňova 99/71, PSČ 612 45
> >>>> Czech Republic
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> infinispan-dev mailing list
> >>>> infinispan-dev at lists.jboss.org
> >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >>> _______________________________________________
> >>> infinispan-dev mailing list
> >>> infinispan-dev at lists.jboss.org
> >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130207/b7fc9d1e/attachment-0001.html 


More information about the infinispan-dev mailing list