Yeah, I wouldn't call this a "simple" solution...

The distribution/replication interceptors are quite high in the interceptor stack, so we'd have to save the state of the interceptor stack (basically the thread's stack) somehow and resume processing it on the thread receiving the responses. In a language that supports continuations that would be a piece of cake, but since we're in Java we'd have to completely change the way the interceptor stack works.

Actually we do hold the lock on modified keys while the command is replicated to the other owners. But think locking wouldn't be a problem: we already allow locks to be owned by transactions instead of threads, so it would just be a matter of creating a "lite transaction" for non-transactional caches. Obviously the TransactionSynchronizerInterceptor would have to go, but I see that as a positive thing ;)

So yeah, it could work, but it would take a huge amount of effort and it's going to obfuscate the code. Plus, I'm not at all convinced that it's going to improve performance that much compared to a new thread pool.

Cheers

Dan

On Fri, Feb 1, 2013 at 10:59 AM, Radim Vansa <rvansa@redhat.com> wrote:

Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible.

Let's see what someone more informed (Dan?) would think about that.

Thanks, Bela

Radim

----- Original Message -----
| From: "Bela Ban" <bban@redhat.com>
| To: infinispan-dev@lists.jboss.org
| Sent: Friday, February 1, 2013 9:39:43 AM
| Subject: Re: [infinispan-dev] Threadpools in a large cluster
|
| It looks like the core problem is an incoming RPC-1 which triggers
| another blocking RPC-2: the thread delivering RPC-1 is blocked
| waiting
| for the response from RPC-2, and can therefore not be used to serve
| other requests for the duration of RPC-2. If RPC-2 takes a while,
| e.g.
| waiting to acquire a lock in the remote node, then it is clear that
| the
| thread pool will quickly exceed its max size.
|
| A simple solution would be to prevent invoking blocking RPCs *from
| within* a received RPC. Let's take a look at an example:
| - A invokes a blocking PUT-1 on B
| - B forwards the request as blocking PUT-2 to C and D
| - When PUT-2 returns and B gets the responses from C and D (or the
| first
| one to respond, don't know exactly how this is implemented), it sends
| the response back to A (PUT-1 terminates now at A)
|
| We could change this to the following:
| - A invokes a blocking PUT-1 on B
| - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D,
| it
| does the following:
| - B invokes PUT-2 and gets a future
| - B adds itself as a FutureListener, and it also stores the
| address of the original sender (A)
| - When the FutureListener is invoked, B sends back the result
| as a
| response to A
| - Whenever a member leaves the cluster, the corresponding futures are
| cancelled and removed from the hashmaps
|
| This could probably be done differently (e.g. by sending asynchronous
| messages and implementing a finite state machine), but the core of
| the
| solution is the same; namely to avoid having an incoming thread block
| on
| a sync RPC.
|
| Thoughts ?
|
|
|
|
| On 2/1/13 9:04 AM, Radim Vansa wrote:
| > Hi guys,
| >
| > after dealing with the large cluster for a while I find the way how
| > we use OOB threads in synchronous configuration non-robust.
| > Imagine a situation where node which is not an owner of the key
| > calls PUT. Then the a RPC is called to the primary owner of that
| > key, which reroutes the request to all other owners and after
| > these reply, it replies back.
| > There are two problems:
| > 1) If we do simultanously X requests from non-owners to the primary
| > owner where X is OOB TP size, all the OOB threads are waiting for
| > the responses and there is no thread to process the OOB response
| > and release the thread.
| > 2) Node A is primary owner of keyA, non-primary owner of keyB and B
| > is primary of keyB and non-primary of keyA. We got many requests
| > for both keyA and keyB from other nodes, therefore, all OOB
| > threads from both nodes call RPC to the non-primary owner but
| > there's noone who could process the request.
| >
| > While we wait for the requests to timeout, the nodes with depleted
| > OOB threadpools start suspecting all other nodes because they
| > can't receive heartbeats etc...
| >
| > You can say "increase your OOB tp size", but that's not always an
| > option, I have currently set it to 1000 threads and it's not
| > enough. In the end, I will be always limited by RAM and something
| > tells me that even nodes with few gigs of RAM should be able to
| > form a huge cluster. We use 160 hotrod worker threads in JDG, that
| > means that 160 * clusterSize = 10240 (64 nodes in my cluster)
| > parallel requests can be executed, and if 10% targets the same
| > node with 1000 OOB threads, it stucks. It's about scaling and
| > robustness.
| >
| > Not that I'd have any good solution, but I'd really like to start a
| > discussion.
| > Thinking about it a bit, the problem is that blocking call (calling
| > RPC on primary owner from message handler) can block non-blocking
| > calls (such as RPC response or command that never sends any more
| > messages). Therefore, having a flag on message "this won't send
| > another message" could let the message be executed in different
| > threadpool, which will be never deadlocked. In fact, the pools
| > could share the threads but the non-blocking would have always a
| > few threads spare.
| > It's a bad solution as maintaining which message could block in the
| > other node is really, really hard (we can be sure only in case of
| > RPC responses), especially when some locks come. I will welcome
| > anything better.
|
| --
| Bela Ban, JGroups lead (http://www.jgroups.org)
|
| _______________________________________________
| infinispan-dev mailing list
| infinispan-dev@lists.jboss.org
| https://lists.jboss.org/mailman/listinfo/infinispan-dev
|
_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev