ClusteredCacheLoader deadlocks and JBCCACHE-1103

jboss-cache build.1087 Build...

jboss-cache-testsuite-jdk14 Build...

Manik Surtani

Wednesday, 13 June 2007 Wed, 13 Jun '07

10:52 a.m.

Looking through this in detail, I see that the main problem is my assumption that the timeout parameter in the JGroups RpcDispatcher's callRemoteMethods() only starts ticking once the transport layer puts the call on the wire. IMO this is misleading and wrong, and should start ticking the moment I make an API call into the RpcDispatcher. The very fact that I provide a timeout means that I shouldn't ever expect the call to block forever no matter what the circumstance (a bit like calling Object.wait() with no params vs calling it with a timeout). Just had a chat with Jason to bounce a few ideas off him, and here is what I can do in JBoss Cache to work around this (IMO an ugly workaround): *** JBC workaround (does not work; just looked through FC code in HEAD and it swallows InterruptedException - move on!) - All calls to the RpcDispatcher register with a Map, containing a reference to the Thread and the timeout before making the RpcDispatcher call - JBC runs a separate "RpcTimeoutMonitor" thread which periodically checks threads making RpcDispatcher calls against their timeouts, interrupting those that have taken too long. - The code calling the RpcDispatcher wrapped by a try block, attempting to catch interrupted exceptions, and throws a timeout exception to signify RPC timeout. The problem with this approach is the extra overhead of a RpcTimeoutMonitor thread. The fact that the timeout will not be 100% accurate is not a problem - a "best effort" is good enough, so even if a call only times out after 2200 ms even though it was called with a timeout param of 2000 should not be of consequence. At least calls don't get stuck, regardless of why or where in the RPC process it is held up. *** More inefficient JBC workaround - All calls to the RpcDispatcher happen in a separate thread, using a ThreadPoolExecutor - The app thread then waits for timeout ms, and if the RPC call hasn't completed, throws a timeout exception - so even if the thread is stuck in, say, FC, at least JBC can roll back the tx and release locks, etc. Inefficient because each and every RPC call will happen in a separate thread + potential ugliness around orphaned threads stuck in a JGroups protocol. *** JGroups fix - I think this can be done easily, where any blocking operations in JGroups protocols make use of the timeout parameter. Again, this will not provide 100% timeout accuracy. but a "best effort", but like I said IMO this is ok. (At the moment FC loops until it has enough creds. I think this loop should timeout using the same timeout param.) Now passing this param will involve a transient field in Event which RequestCorrelator could use to set the timeout. Protocols like FC can then use this timeout to determine how long it shuld loop for when waiting for creds. Thoughts? My preferred option is the last one, since it gives the timeout param in the RpcDispatcher more meaning. Cheers, -- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Show replies by date

Bela Ban

Thursday, 14 June Thu, 14 Jun

4:23 a.m.

...

Where does FC swallow an InterruptedException ? The only place where I catch an exception is in handleDownMessage(): catch(InterruptedException e) { // set the interrupted flag again, so the caller's thread can handle the interrupt as well Thread.currentThread().interrupt(); } This does not swallow the exception; it rather passes it on to the calling thread, as suggested by JCIP.

...

All calls to the RpcDispatcher register with a Map, containing a reference to the Thread and the timeout before making the RpcDispatcher call - JBC runs a separate "RpcTimeoutMonitor" thread which periodically checks threads making RpcDispatcher calls against their timeouts, interrupting those that have taken too long. - The code calling the RpcDispatcher wrapped by a try block, attempting to catch interrupted exceptions, and throws a timeout exception to signify RPC timeout. The problem with this approach is the extra overhead of a RpcTimeoutMonitor thread. The fact that the timeout will not be 100% accurate is not a problem - a "best effort" is good enough, so even if a call only times out after 2200 ms even though it was called with a timeout param of 2000 should not be of consequence. At least calls don't get stuck, regardless of why or where in the RPC process it is held up. *** More inefficient JBC workaround - All calls to the RpcDispatcher happen in a separate thread, using a ThreadPoolExecutor - The app thread then waits for timeout ms, and if the RPC call hasn't completed, throws a timeout exception - so even if the thread is stuck in, say, FC, at least JBC can roll back the tx and release locks, etc. Inefficient because each and every RPC call will happen in a separate thread + potential ugliness around orphaned threads stuck in a JGroups protocol. *** JGroups fix - I think this can be done easily, where any blocking operations in JGroups protocols make use of the timeout parameter. Again, this will not provide 100% timeout accuracy. but a "best effort", but like I said IMO this is ok. (At the moment FC loops until it has enough creds. I think this loop should timeout using the same timeout param.) Now passing this param will involve a transient field in Event which RequestCorrelator could use to set the timeout. Protocols like FC can then use this timeout to determine how long it shuld loop for when waiting for creds. Thoughts? My preferred option is the last one, since it gives the timeout param in the RpcDispatcher more meaning.

No, that's a bad solution because the design of a flow control protocol should not be influenced by an application level workaround. In addition, if you have 500 threads, all timing out at the same time, you will have a steady flow of messages, defeating the purpose of flow control in the first place. On top of that, we *cannot* do that because if we run on top of TCP, a write() might block anyway if the TCP receiver set the sliding window to 0 ! So the sending of data on top of TCP will block (similar to FC) when TCP throttles the sender. By the way, some decades ago the same issue of 'timed method calls' occurred in CORBA, e.g. invoke foo() but it should take 450ms tops. What *could* be done here is to add an option to RpcDispatcher to use separate threads from a thread pool to dispatch requests and correlate responses. So, the caller would create a task (which sends the request and waits for all responses, possibly listening for cancellation), submit it to the pool and get a future. Then wait on the future for N milliseconds and return with the current results after that time, or throw a timeout exception, whatever. This issue cannot and should not be solved at the FC level, by 'bypassing' flow control ! Note that, under normal circumstances, and with 2.5, FC should never block for an extended time. -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Brian Stansberry

7:48 a.m.

...

Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we recommended to remove FC when making synchronous method calls. From 1103: 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so the CCL tries to do a clustered get. The CacheLoaderInterceptor lock for fqnA is held at this time. 2) Thread1 blocks in FC waiting for credits. 3) Replication message for fqnA arrives from node2. 4) IncomingPacketHandler thread will block waiting for the CacheLoaderInterceptor lock for fqnA. 5) FC credits cannot arrive, so we deadlock. Why can credit responses not arrive ? This is 2.5, not 2.4 right ? In 2.5, credit responses arrive as OOB messages, so they will always delivered, unless you disabled the OOB thread pool. Or is this 2.4 ? Then remove FC from the stack. Oh shit, just checked the case, this *is* 1.4.1 SP3 ! Yes, then remove FC from the stack; because you make synchronous method calls, this is not an issue as we don't need flow control in that case. Comments inline. Manik Surtani wrote: > Looking through this in detail, I see that the main problem is my > assumption that the timeout parameter in the JGroups RpcDispatcher's > callRemoteMethods() only starts ticking once the transport layer puts > the call on the wire. IMO this is misleading and wrong, and should > start ticking the moment I make an API call into the RpcDispatcher. > The very fact that I provide a timeout means that I shouldn't ever > expect the call to block forever no matter what the circumstance (a > bit like calling Object.wait() with no params vs calling it with a > timeout). Just had a chat with Jason to bounce a few ideas off him, > and here is what I can do in JBoss Cache to work around this (IMO an > ugly workaround): > > *** JBC workaround (does not work; just looked through FC code in HEAD > and it swallows InterruptedException - move on!) Where does FC swallow an InterruptedException ? The only place where I catch an exception is in handleDownMessage(): catch(InterruptedException e) { // set the interrupted flag again, so the caller's thread can handle the interrupt as well Thread.currentThread().interrupt(); } This does not swallow the exception; it rather passes it on to the calling thread, as suggested by JCIP. > All calls to the RpcDispatcher register with a Map, containing a > reference to the Thread and the timeout before making the > RpcDispatcher call > - JBC runs a separate "RpcTimeoutMonitor" thread which periodically > checks threads making RpcDispatcher calls against their timeouts, > interrupting those that have taken too long. > - The code calling the RpcDispatcher wrapped by a try block, > attempting to catch interrupted exceptions, and throws a timeout > exception to signify RPC timeout. > > The problem with this approach is the extra overhead of a > RpcTimeoutMonitor thread. The fact that the timeout will not be 100% > accurate is not a problem - a "best effort" is good enough, so even if > a call only times out after 2200 ms even though it was called with a > timeout param of 2000 should not be of consequence. At least calls > don't get stuck, regardless of why or where in the RPC process it is > held up. > > *** More inefficient JBC workaround > > - All calls to the RpcDispatcher happen in a separate thread, using a > ThreadPoolExecutor > - The app thread then waits for timeout ms, and if the RPC call hasn't > completed, throws a timeout exception - so even if the thread is stuck > in, say, FC, at least JBC can roll back the tx and release locks, etc. > > Inefficient because each and every RPC call will happen in a separate > thread + potential ugliness around orphaned threads stuck in a JGroups > protocol. > > *** JGroups fix - I think this can be done easily, where any blocking > operations in JGroups protocols make use of the timeout parameter. > Again, this will not provide 100% timeout accuracy. but a "best > effort", but like I said IMO this is ok. (At the moment FC loops > until it has enough creds. I think this loop should timeout using the > same timeout param.) Now passing this param will involve a transient > field in Event which RequestCorrelator could use to set the timeout. > Protocols like FC can then use this timeout to determine how long it > shuld loop for when waiting for creds. > > Thoughts? My preferred option is the last one, since it gives the > timeout param in the RpcDispatcher more meaning. No, that's a bad solution because the design of a flow control protocol should not be influenced by an application level workaround. In addition, if you have 500 threads, all timing out at the same time, you will have a steady flow of messages, defeating the purpose of flow control in the first place. On top of that, we *cannot* do that because if we run on top of TCP, a write() might block anyway if the TCP receiver set the sliding window to 0 ! So the sending of data on top of TCP will block (similar to FC) when TCP throttles the sender. By the way, some decades ago the same issue of 'timed method calls' occurred in CORBA, e.g. invoke foo() but it should take 450ms tops. What *could* be done here is to add an option to RpcDispatcher to use separate threads from a thread pool to dispatch requests and correlate responses. So, the caller would create a task (which sends the request and waits for all responses, possibly listening for cancellation), submit it to the pool and get a future. Then wait on the future for N milliseconds and return with the current results after that time, or throw a timeout exception, whatever. This issue cannot and should not be solved at the FC level, by 'bypassing' flow control ! Note that, under normal circumstances, and with 2.5, FC should never block for an extended time.

-- Brian Stansberry Lead, AS Clustering JBoss, a division of Red Hat brian.stansberry(a)redhat.com

Manik Surtani

11:39 a.m.

Ok, so I think we have a solution here. Bela's created http://jira.jboss.com/jira/browse/JGRP-533 (see task for details on this) Once this is done (JGroups 2.4.1.SP4) I can fix JBCACHE-1103 by patching the CCL to make use of this JGroups fix. This will mean a new release for JBC (1.4.1.SP4) Cheers, Manik On 14 Jun 2007, at 13:48, Brian Stansberry wrote:

...

This is http session repl; the overall cache is REPL_ASYNC, so FC is needed. It is only the "clustered get" call made by the ClusteredCacheLoader that makes a sync call. The CCL directly uses the RpcDispatcher and uses a sync call since that is the semantic needed for that particular call. Background -- use of CCL is an experiment to try to get around problems with initial state transfer with large states. Good point about OOB in 2.5; that should prevent this situation. :) Bela Ban wrote: > Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we > recommended to remove FC when making synchronous method calls. > From 1103: > 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so > the CCL tries to do a clustered get. The CacheLoaderInterceptor > lock for fqnA is held at this time. > 2) Thread1 blocks in FC waiting for credits. > 3) Replication message for fqnA arrives from node2. > 4) IncomingPacketHandler thread will block waiting for the > CacheLoaderInterceptor lock for fqnA. > 5) FC credits cannot arrive, so we deadlock. > Why can credit responses not arrive ? This is 2.5, not 2.4 right ? > In 2.5, credit responses arrive as OOB messages, so they will > always delivered, unless you disabled the OOB thread pool. Or is > this 2.4 ? Then remove FC from the stack. Oh shit, just checked > the case, this *is* 1.4.1 SP3 ! Yes, then remove FC from the > stack; because you make synchronous method calls, this is not an > issue as we don't need flow control in that case. > Comments inline. > Manik Surtani wrote: >> Looking through this in detail, I see that the main problem is my >> assumption that the timeout parameter in the JGroups >> RpcDispatcher's callRemoteMethods() only starts ticking once the >> transport layer puts the call on the wire. IMO this is >> misleading and wrong, and should start ticking the moment I make >> an API call into the RpcDispatcher. The very fact that I provide >> a timeout means that I shouldn't ever expect the call to block >> forever no matter what the circumstance (a bit like calling >> Object.wait() with no params vs calling it with a timeout). Just >> had a chat with Jason to bounce a few ideas off him, and here is >> what I can do in JBoss Cache to work around this (IMO an ugly >> workaround): >> >> *** JBC workaround (does not work; just looked through FC code in >> HEAD and it swallows InterruptedException - move on!) > Where does FC swallow an InterruptedException ? The only place > where I catch an exception is in handleDownMessage(): > catch(InterruptedException e) { > // set the interrupted flag again, so the caller's thread can > handle the interrupt as well > Thread.currentThread().interrupt(); > } > This does not swallow the exception; it rather passes it on to the > calling thread, as suggested by JCIP. >> All calls to the RpcDispatcher register with a Map, containing a >> reference to the Thread and the timeout before making the >> RpcDispatcher call >> - JBC runs a separate "RpcTimeoutMonitor" thread which >> periodically checks threads making RpcDispatcher calls against >> their timeouts, interrupting those that have taken too long. >> - The code calling the RpcDispatcher wrapped by a try block, >> attempting to catch interrupted exceptions, and throws a timeout >> exception to signify RPC timeout. >> >> The problem with this approach is the extra overhead of a >> RpcTimeoutMonitor thread. The fact that the timeout will not be >> 100% accurate is not a problem - a "best effort" is good enough, >> so even if a call only times out after 2200 ms even though it was >> called with a timeout param of 2000 should not be of >> consequence. At least calls don't get stuck, regardless of why >> or where in the RPC process it is held up. >> >> *** More inefficient JBC workaround >> >> - All calls to the RpcDispatcher happen in a separate thread, >> using a ThreadPoolExecutor >> - The app thread then waits for timeout ms, and if the RPC call >> hasn't completed, throws a timeout exception - so even if the >> thread is stuck in, say, FC, at least JBC can roll back the tx >> and release locks, etc. >> >> Inefficient because each and every RPC call will happen in a >> separate thread + potential ugliness around orphaned threads >> stuck in a JGroups protocol. >> >> *** JGroups fix - I think this can be done easily, where any >> blocking operations in JGroups protocols make use of the timeout >> parameter. Again, this will not provide 100% timeout accuracy. >> but a "best effort", but like I said IMO this is ok. (At the >> moment FC loops until it has enough creds. I think this loop >> should timeout using the same timeout param.) Now passing this >> param will involve a transient field in Event which >> RequestCorrelator could use to set the timeout. Protocols like >> FC can then use this timeout to determine how long it shuld loop >> for when waiting for creds. >> >> Thoughts? My preferred option is the last one, since it gives >> the timeout param in the RpcDispatcher more meaning. > No, that's a bad solution because the design of a flow control > protocol should not be influenced by an application level > workaround. In addition, if you have 500 threads, all timing out > at the same time, you will have a steady flow of messages, > defeating the purpose of flow control in the first place. > On top of that, we *cannot* do that because if we run on top of > TCP, a write() might block anyway if the TCP receiver set the > sliding window to 0 ! So the sending of data on top of TCP will > block (similar to FC) when TCP throttles the sender. > By the way, some decades ago the same issue of 'timed method > calls' occurred in CORBA, e.g. invoke foo() but it should take > 450ms tops. > What *could* be done here is to add an option to RpcDispatcher to > use separate threads from a thread pool to dispatch requests and > correlate responses. So, the caller would create a task (which > sends the request and waits for all responses, possibly listening > for cancellation), submit it to the pool and get a future. Then > wait on the future for N milliseconds and return with the current > results after that time, or throw a timeout exception, whatever. > This issue cannot and should not be solved at the FC level, by > 'bypassing' flow control ! Note that, under normal circumstances, > and with 2.5, FC should never block for an extended time. -- Brian Stansberry Lead, AS Clustering JBoss, a division of Red Hat brian.stansberry(a)redhat.com

-- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Jimmy Wilson

2:10 p.m.

[I subscribed as jimmy.wilson(a)redhat.com not jawilson(a)redhat.com, so it took me a while to reply. That's been fixed...]

...

> Ok, so I think we have a solution here.

Great!

...

> Bela's created http://jira.jboss.com/jira/browse/JGRP-533 (see task > for details on this)

I can build, run the testsuite, and provide a test release to the customer once this is complete just as we've already done for JGRP-455.

...

> Once this is done (JGroups 2.4.1.SP4) I can fix JBCACHE-1103 by > patching the CCL to make use of this JGroups fix. This will mean a > new release for JBC (1.4.1.SP4)

I can do the same here by running through the AS SR tests using the modified CCL, but those tests didn't catch this issue either. I can easily cut a test JAR of JBC for the customer to try whenever we're ready to do so if you like. Jimmy -- Jimmy Wilson jimmy.wilson(a)redhat.com

Jimmy Wilson

5:12 p.m.

I know it is not a popular question, but the customer is asking for a time line for these fixes. I've told him that the scope of changes is well defined, so we know it will not be too long, however, he is getting anxious because they wish to roll out to QA before long I think. Is there anything else I can do to help speed up the process beyond what I've already volunteered myself for? Even if we don't have a good answer at the moment, that's okay. I just wanted to see if we could postulate anything... Jimmy -- Jimmy Wilson jimmy.wilson(a)redhat.com

Bela Ban

Friday, 15 June Fri, 15 Jun

2:10 a.m.

Let's properly discuss the solution first, as I explained in my previous email, the JIRA issue in JGroups won't help. The underlying issue is a synchronous call while holding a lock. Jimmy Wilson wrote:

...

-- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Bela Ban

2:09 a.m.

I looked at 1103 in a bit more detail and concluded that the change in JGroups (http://jira.jboss.com/jira/browse/JGRP-533) would not help. The underlying issue is that (a) 2.4.1 has a single incoming request queue and (b) the ClusteredCacheLoader holds a lock while making a cluster-wide call. Let's look at an example: 1. The CCL acquires a lock on Fqn-A and makes a cluster-wide call 2. We get a replication message for A (RM-A), so someone made an update to A and is now trying to commit the change. RM-A tries to acquire the lock on A, but is blocked because the CCL holds it. 3. Now a result for the CCL call arrives. It is not processed (single queue) until RM-A gets processed. However, that's not the case until the CCL call completes. In this case, the only way for the CCL call to complete is via a timeout, as it will never get its results. So even if I implemented 533, it wouldn't help, as the interleaving between CCL calls and RM messages for the same FQNs would lead to timeouts. Now, a possible solution to 1103 is that we make the CCL call *without* holding a lock. When we get the result(s), only *then* do we acquire a lock and update the FQN. We also need to check whether FQN-A was updated in the mean time and then decide which value to return (the value set by the RM or the one gotten from the CCL call). WDYT ? -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Manik Surtani

4:16 a.m.

...

-- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Bela Ban

6:59 a.m.

I don't think timeouts are very useful in this scenario. You are liable to get updates from other nodes for A. You are liable to read A locally, causing the CCL to make a cluster-wide call. Both CCL calls and updates from other nodes are likely to be interspersed. Each timeout this occurs you have to wait until the timeout elapses, say that's 10 seconds. This might even cause the remote update(s) to fail. This is compounded when you have more than 1 thread making updates and/or causing CCL calls to happen, which is likely with HTTP session replication (not sure though if they hit the same region, maybe only happens with multi-frame or AJAX-like apps). Yes, deadlocks are avoided, but the system is practically unusable in such a case. Fact is that it is bad practice to hold a lock and then make a cluster-wide call, as this is liable to deadlock (or timeout-lock). I wonder if the lock mechanism in the CacheLoader can be overridden in CCL, so that you don't acquire a lock until after the remote call has been made ? In general, one should never hold a lock while doing something that can potentially block, e.g. in JGroups I do the following (edited): while(length > lowest_credit && running) { boolean rc=credits_available.await(max_block_time, TimeUnit.MILLISECONDS); if(rc || length <= lowest_credit || !running) break; long wait_time=System.currentTimeMillis() - last_credit_request; if(wait_time >= max_block_time) { last_credit_request=System.currentTimeMillis(); // we need to send the credit requests down *without* holding the sent_lock, otherwise we might // run into the deadlock described in http://jira.jboss.com/jira/browse/JGRP-292 Map<Address,Long> sent_copy=new HashMap<Address,Long>(sent); sent_copy.keySet().retainAll(creditors); sent_lock.unlock(); try { for(Map.Entry<Address,Long> entry: sent_copy.entrySet()) { sendCreditRequest(entry.getKey(), entry.getValue()); } } finally { sent_lock.lock(); } } Manik Surtani wrote:

...

Yes, it would timeout, but that is better than the deadlock that currently occurs. IMO I think the timeout is a valid response to such a call. If the CCL cannot complete a remote call because of a remote lock, it should timeout. And when it does, it releases the lock on Fqn-A and the update originating remotely can proceed. The problem with making the CCL call without a lock is that this exposes concurrent loading and overwriting for all cache loaders (the CCL is treated as a simple cache loader impl by the CacheLoaderInterceptor). This also has implications with race conditions on eviction (Without locks in the cache loader interceptor, the following could happen: Thread-1 does a get(), goes through the cache loader interceptor, sees the node requested in memory and does not load. Eviction-thread gets a WL on the same node and evicts it. Thread-1 now gets to the PessimisticLockInterceptor, cannot find the node, and since this is a get() call, doesn't create the node but returns a null) On 15 Jun 2007, at 08:09, Bela Ban wrote: > I looked at 1103 in a bit more detail and concluded that the change > in JGroups (http://jira.jboss.com/jira/browse/JGRP-533) would not > help. The underlying issue is that (a) 2.4.1 has a single incoming > request queue and (b) the ClusteredCacheLoader holds a lock while > making a cluster-wide call. Let's look at an example: > > 1. The CCL acquires a lock on Fqn-A and makes a cluster-wide call > 2. We get a replication message for A (RM-A), so someone made an > update to A and is now trying to commit the change. RM-A tries to > acquire the lock on A, but is blocked because the CCL holds it. > 3. Now a result for the CCL call arrives. It is not processed (single > queue) until RM-A gets processed. However, that's not the case > until the CCL call completes. In this case, the only way for the > CCL call to complete is via a timeout, as it will never get its > results. > > So even if I implemented 533, it wouldn't help, as the interleaving > between CCL calls and RM messages for the same FQNs would lead to > timeouts. > > Now, a possible solution to 1103 is that we make the CCL call > *without* holding a lock. When we get the result(s), only *then* do > we acquire a lock and update the FQN. We also need to check whether > FQN-A was updated in the mean time and then decide which value to > return (the value set by the RM or the one gotten from the CCL call). > > WDYT ? > > -- > Bela Ban > Lead JGroups / JBoss Clustering team > JBoss - a division of Red Hat -- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

-- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Jason T. Greene

7 a.m.

On Fri, 2007-06-15 at 09:09 +0200, Bela Ban wrote:

...

That should be ok though because the CCL will still timeout on that lock. The original problem was that the same thread with the CCL lock was blocking on an FC lock so that the CCL lock would never be released (since the FC lock was higher in the stack). -Jason -- Jason T. Greene Lead, POJO Cache JBoss, a division of Red Hat

Bela Ban

7:09 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

Jason T. Greene wrote:

...

See my previous reply. Yes, it blocked on FC.down() because it didn't receive credits in up(). But up() wasn't called because there was a replication message ahead of it in the queue that blocked on the FQN held by the CCL. So to tackle this, my suggestion were, in this order: #1 Don't hold a lock while making a synchronous cluster method call. That's a big no no, especially in pre-2.5 releases. We had lots of bugs in the clustering code due to such code. Then Brian cleaned up all of it... :-) #2 The timeout mechanism in JGroups which uses threads. Ugly, and a hack, and only needed for 2.4. As I argued, this will avoid the deadlock, but it will constantly time out (assuming some traffic). The root cause of this is #1 -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Manik Surtani

Monday, 18 June Mon, 18 Jun

5:48 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

On 15 Jun 2007, at 13:09, Bela Ban wrote:

...

Jason T. Greene wrote: > That should be ok though because the CCL will still timeout on that > lock. The original problem was that the same thread with the CCL lock > was blocking on an FC lock so that the CCL lock would never be > released > (since the FC lock was higher in the stack). See my previous reply. Yes, it blocked on FC.down() because it didn't receive credits in up(). But up() wasn't called because there was a replication message ahead of it in the queue that blocked on the FQN held by the CCL. So to tackle this, my suggestion were, in this order: #1 Don't hold a lock while making a synchronous cluster method call. That's a big no no, especially in pre-2.5 releases. We had lots of bugs in the clustering code due to such code. Then Brian cleaned up all of it... :-) #2 The timeout mechanism in JGroups which uses threads. Ugly, and a hack, and only needed for 2.4. As I argued, this will avoid the deadlock, but it will constantly time out (assuming some traffic). The root cause of this is #1

Let me look into why we had #1 anyway. Originally the 1.2.x codebase used a synchronized block on the CacheLoaderInterceptor for this which meant that only one thread could pass through this interceptor at any given time. I changed this to lock on the Fqn in question so at least if the Fqns didn't overlap multiple threads could go thru this interceptor. The reason behind it seems to be so that the CacheLoader impl does not have to deal with concurrent calls on the same node, but thinking about it, I feel this is something that should be handled in each CacheLoader impl, which should be thread safe.

Manik Surtani

6:08 p.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

On 18 Jun 2007, at 11:48, Manik Surtani wrote:

...

On 15 Jun 2007, at 13:09, Bela Ban wrote: > > > Jason T. Greene wrote: >> That should be ok though because the CCL will still timeout on that >> lock. The original problem was that the same thread with the CCL >> lock >> was blocking on an FC lock so that the CCL lock would never be >> released >> (since the FC lock was higher in the stack). > > See my previous reply. Yes, it blocked on FC.down() because it > didn't receive credits in up(). But up() wasn't called because > there was a replication message ahead of it in the queue that > blocked on the FQN held by the CCL. > > So to tackle this, my suggestion were, in this order: > #1 Don't hold a lock while making a synchronous cluster method > call. That's a big no no, especially in pre-2.5 releases. We had > lots of bugs in the clustering code due to such code. Then Brian > cleaned up all of it... :-) > #2 The timeout mechanism in JGroups which uses threads. Ugly, and > a hack, and only needed for 2.4. As I argued, this will avoid the > deadlock, but it will constantly time out (assuming some traffic). > > The root cause of this is #1 Let me look into why we had #1 anyway. Originally the 1.2.x codebase used a synchronized block on the CacheLoaderInterceptor for this which meant that only one thread could pass through this interceptor at any given time. I changed this to lock on the Fqn in question so at least if the Fqns didn't overlap multiple threads could go thru this interceptor. The reason behind it seems to be so that the CacheLoader impl does not have to deal with concurrent calls on the same node, but thinking about it, I feel this is something that should be handled in each CacheLoader impl, which should be thread safe.

Ok, I spent a bit of time investigating this, and the original sync blocks in the CacheLoaderInterceptor, superceded by the synchronization on an Fqn in subsequent versions are all hacks to get around the fact that some cache loader implementations themselves aren't thread safe. I wrote a test to check the thread safety of cache loader impls, and most of what we have fail. I patched the DummyInMemoryCacheLoader to be thread safe (a lot of synchronization within the impl) and removed the unnecessary locks in the interceptors and things work fine. I tried this with some of the few thread-safe cache loader impls we have (BDBJE, for example) and this worked well as well. So what I propose is this: 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor) 2) Make sure we acquire locks (lock() call up the interceptor stack) *before* attempting to load node, but after creating temp node 3) Make sure cache loader impls are thread safe 4) Add thread safety test to CacheLoaderTestBase 5) Patch loaders that aren't threadsafe and fail test in (4) - FileCacheLoader, JDBCCacheLoader, etc. I'm going to do this on my local checkout of HEAD and run the regression tests overnight. If this works, I suggest backporting this to 1.4.x and regressing there. WDYT? Cheers, -- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Jason T. Greene

6:14 p.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

Simply awesome! On Tue, 2007-06-19 at 00:08 +0100, Manik Surtani wrote:

...

On 18 Jun 2007, at 11:48, Manik Surtani wrote: > > On 15 Jun 2007, at 13:09, Bela Ban wrote: > >> >> >> Jason T. Greene wrote: >>> That should be ok though because the CCL will still timeout on that >>> lock. The original problem was that the same thread with the CCL >>> lock >>> was blocking on an FC lock so that the CCL lock would never be >>> released >>> (since the FC lock was higher in the stack). >> >> See my previous reply. Yes, it blocked on FC.down() because it >> didn't receive credits in up(). But up() wasn't called because >> there was a replication message ahead of it in the queue that >> blocked on the FQN held by the CCL. >> >> So to tackle this, my suggestion were, in this order: >> #1 Don't hold a lock while making a synchronous cluster method >> call. That's a big no no, especially in pre-2.5 releases. We had >> lots of bugs in the clustering code due to such code. Then Brian >> cleaned up all of it... :-) >> #2 The timeout mechanism in JGroups which uses threads. Ugly, and >> a hack, and only needed for 2.4. As I argued, this will avoid the >> deadlock, but it will constantly time out (assuming some traffic). >> >> The root cause of this is #1 > > Let me look into why we had #1 anyway. Originally the 1.2.x > codebase used a synchronized block on the CacheLoaderInterceptor > for this which meant that only one thread could pass through this > interceptor at any given time. I changed this to lock on the Fqn > in question so at least if the Fqns didn't overlap multiple threads > could go thru this interceptor. > > The reason behind it seems to be so that the CacheLoader impl does > not have to deal with concurrent calls on the same node, but > thinking about it, I feel this is something that should be handled > in each CacheLoader impl, which should be thread safe. Ok, I spent a bit of time investigating this, and the original sync blocks in the CacheLoaderInterceptor, superceded by the synchronization on an Fqn in subsequent versions are all hacks to get around the fact that some cache loader implementations themselves aren't thread safe. I wrote a test to check the thread safety of cache loader impls, and most of what we have fail. I patched the DummyInMemoryCacheLoader to be thread safe (a lot of synchronization within the impl) and removed the unnecessary locks in the interceptors and things work fine. I tried this with some of the few thread-safe cache loader impls we have (BDBJE, for example) and this worked well as well. So what I propose is this: 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor) 2) Make sure we acquire locks (lock() call up the interceptor stack) *before* attempting to load node, but after creating temp node 3) Make sure cache loader impls are thread safe 4) Add thread safety test to CacheLoaderTestBase 5) Patch loaders that aren't threadsafe and fail test in (4) - FileCacheLoader, JDBCCacheLoader, etc. I'm going to do this on my local checkout of HEAD and run the regression tests overnight. If this works, I suggest backporting this to 1.4.x and regressing there. WDYT? Cheers, -- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat _______________________________________________ jbosscache-dev mailing list jbosscache-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/jbosscache-dev

-- Jason T. Greene Lead, POJO Cache JBoss, a division of Red Hat

Bruno Georges

Tuesday, 19 June Tue, 19 Jun

12:18 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

+1 On 19 Jun 2007, at 01:14, Jason T. Greene wrote:

...

Simply awesome! On Tue, 2007-06-19 at 00:08 +0100, Manik Surtani wrote: > On 18 Jun 2007, at 11:48, Manik Surtani wrote: > >> >> On 15 Jun 2007, at 13:09, Bela Ban wrote: >> >>> >>> >>> Jason T. Greene wrote: >>>> That should be ok though because the CCL will still timeout on >>>> that >>>> lock. The original problem was that the same thread with the CCL >>>> lock >>>> was blocking on an FC lock so that the CCL lock would never be >>>> released >>>> (since the FC lock was higher in the stack). >>> >>> See my previous reply. Yes, it blocked on FC.down() because it >>> didn't receive credits in up(). But up() wasn't called because >>> there was a replication message ahead of it in the queue that >>> blocked on the FQN held by the CCL. >>> >>> So to tackle this, my suggestion were, in this order: >>> #1 Don't hold a lock while making a synchronous cluster method >>> call. That's a big no no, especially in pre-2.5 releases. We had >>> lots of bugs in the clustering code due to such code. Then Brian >>> cleaned up all of it... :-) >>> #2 The timeout mechanism in JGroups which uses threads. Ugly, and >>> a hack, and only needed for 2.4. As I argued, this will avoid the >>> deadlock, but it will constantly time out (assuming some traffic). >>> >>> The root cause of this is #1 >> >> Let me look into why we had #1 anyway. Originally the 1.2.x >> codebase used a synchronized block on the CacheLoaderInterceptor >> for this which meant that only one thread could pass through this >> interceptor at any given time. I changed this to lock on the Fqn >> in question so at least if the Fqns didn't overlap multiple threads >> could go thru this interceptor. >> >> The reason behind it seems to be so that the CacheLoader impl does >> not have to deal with concurrent calls on the same node, but >> thinking about it, I feel this is something that should be handled >> in each CacheLoader impl, which should be thread safe. > > Ok, I spent a bit of time investigating this, and the original sync > blocks in the CacheLoaderInterceptor, superceded by the > synchronization on an Fqn in subsequent versions are all hacks to get > around the fact that some cache loader implementations themselves > aren't thread safe. > > I wrote a test to check the thread safety of cache loader impls, and > most of what we have fail. I patched the DummyInMemoryCacheLoader to > be thread safe (a lot of synchronization within the impl) and removed > the unnecessary locks in the interceptors and things work fine. I > tried this with some of the few thread-safe cache loader impls we > have (BDBJE, for example) and this worked well as well. > > So what I propose is this: > > 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor) > 2) Make sure we acquire locks (lock() call up the interceptor stack) > *before* attempting to load node, but after creating temp node > 3) Make sure cache loader impls are thread safe > 4) Add thread safety test to CacheLoaderTestBase > 5) Patch loaders that aren't threadsafe and fail test in (4) - > FileCacheLoader, JDBCCacheLoader, etc. > > I'm going to do this on my local checkout of HEAD and run the > regression tests overnight. If this works, I suggest backporting > this to 1.4.x and regressing there. > > WDYT? > > Cheers, > -- > Manik Surtani > > Lead, JBoss Cache > JBoss, a division of Red Hat > > > > _______________________________________________ > jbosscache-dev mailing list > jbosscache-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/jbosscache-dev -- Jason T. Greene Lead, POJO Cache JBoss, a division of Red Hat _______________________________________________ jbosscache-dev mailing list jbosscache-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/jbosscache-dev

Bela Ban

3:04 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock refactoring, sounds like it needs to be done anyway. But, because you have to be thread safe at one point, aren't you deferring the issue to a later stage ? At one point you have to acquire a lock to prevent concurrent loading, and I'm afraid we'll have the same issue. But, wait, if you defer lock acquisition till after the synchronous cluster method call, you should be fine. Is this what you have in mind ? On the same topic: I looked into interrupting clustered method calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533), and came to the conclusion that it isn't feasible. When a thread is blocked on FC.down(), waiting for credits, then I can interrupt it, but it will go right back in its loop waiting for credits, therefore blocking again. Unless I change the loop condition (running & waiting-for-credits), the thread will always block if there aren't enough credits available. Thoughts ? Manik Surtani wrote:

...

-- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Manik Surtani

6:27 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

On 19 Jun 2007, at 09:04, Bela Ban wrote:

...

There are 2 locks here that need to be done. One lock is on the Node, via the lock interceptor, which happens *after* the cluster method call. This is not the problem. We also had a level of sync in the cache loader interceptor to prevent the same Fqn being loaded twice from the CL (or a simultaneous load and store, etc). This bit was the problem, since this sync had to happen *before* any loading was done by the CL. Regarding the other CL impls, some of them already were threadsafe (BDBJE, Jdbm) and some weren't (JDBC, File). I created a StripedLock class which uses a set of ReentrantReadWriteLocks applied to the Fqn in question, and the JDBC and File CLs now use this to achieve thread safety (more efficient than synchronized blocks and better mem usage than one lock per Fqn - usual benefits of lock striping). You're right that the improvements in locking here are not quite central to solving 1103, but somewhat related. The real solution to 1103 is in the CCL - remotely originating put ()'s shouldn't need a load (and hence not need to wait on the Fqn being loaded) *only if* we're talking about the CCL, since CCLs don't perform a corresponding store (they're read-only). Therefore this logic cannot be in the CLI as it doesn't apply to all CL impls.

...

On the same topic: I looked into interrupting clustered method calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533), and came to the conclusion that it isn't feasible. When a thread is blocked on FC.down(), waiting for credits, then I can interrupt it, but it will go right back in its loop waiting for credits, therefore blocking again. Unless I change the loop condition (running & waiting-for-credits), the thread will always block if there aren't enough credits available. Thoughts ? Manik Surtani wrote: > > > Ok, I spent a bit of time investigating this, and the original > sync blocks in the CacheLoaderInterceptor, superceded by the > synchronization on an Fqn in subsequent versions are all hacks to > get around the fact that some cache loader implementations > themselves aren't thread safe. > > I wrote a test to check the thread safety of cache loader impls, > and most of what we have fail. I patched the > DummyInMemoryCacheLoader to be thread safe (a lot of > synchronization within the impl) and removed the unnecessary locks > in the interceptors and things work fine. I tried this with some > of the few thread-safe cache loader impls we have (BDBJE, for > example) and this worked well as well. > > So what I propose is this: > > 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor) > 2) Make sure we acquire locks (lock() call up the interceptor > stack) *before* attempting to load node, but after creating temp node > 3) Make sure cache loader impls are thread safe > 4) Add thread safety test to CacheLoaderTestBase > 5) Patch loaders that aren't threadsafe and fail test in (4) - > FileCacheLoader, JDBCCacheLoader, etc. > > I'm going to do this on my local checkout of HEAD and run the > regression tests overnight. If this works, I suggest backporting > this to 1.4.x and regressing there. > > WDYT? -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

-- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Bela Ban

6:32 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

...

On 19 Jun 2007, at 09:04, Bela Ban wrote: > Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock > refactoring, sounds like it needs to be done anyway. But, because you > have to be thread safe at one point, aren't you deferring the issue > to a later stage ? At one point you have to acquire a lock to prevent > concurrent loading, and I'm afraid we'll have the same issue. > > But, wait, if you defer lock acquisition till after the synchronous > cluster method call, you should be fine. Is this what you have in mind ? There are 2 locks here that need to be done. One lock is on the Node, via the lock interceptor, which happens *after* the cluster method call. This is not the problem. We also had a level of sync in the cache loader interceptor to prevent the same Fqn being loaded twice from the CL (or a simultaneous load and store, etc). This bit was the problem, since this sync had to happen *before* any loading was done by the CL. Regarding the other CL impls, some of them already were threadsafe (BDBJE, Jdbm) and some weren't (JDBC, File). I created a StripedLock class which uses a set of ReentrantReadWriteLocks applied to the Fqn in question, and the JDBC and File CLs now use this to achieve thread safety (more efficient than synchronized blocks and better mem usage than one lock per Fqn - usual benefits of lock striping). You're right that the improvements in locking here are not quite central to solving 1103, but somewhat related. The real solution to 1103 is in the CCL - remotely originating put()'s shouldn't need a load (and hence not need to wait on the Fqn being loaded) *only if* we're talking about the CCL, since CCLs don't perform a corresponding store (they're read-only). Therefore this logic cannot be in the CLI as it doesn't apply to all CL impls. > > On the same topic: I looked into interrupting clustered method calls > in JGroups (http://jira.jboss.com/jira/browse/JGRP-533), and came to > the conclusion that it isn't feasible. When a thread is blocked on > FC.down(), waiting for credits, then I can interrupt it, but it will > go right back in its loop waiting for credits, therefore blocking > again. Unless I change the loop condition (running & > waiting-for-credits), the thread will always block if there aren't > enough credits available. > > Thoughts ? > > > Manik Surtani wrote: >> >> >> Ok, I spent a bit of time investigating this, and the original sync >> blocks in the CacheLoaderInterceptor, superceded by the >> synchronization on an Fqn in subsequent versions are all hacks to >> get around the fact that some cache loader implementations >> themselves aren't thread safe. >> >> I wrote a test to check the thread safety of cache loader impls, and >> most of what we have fail. I patched the DummyInMemoryCacheLoader >> to be thread safe (a lot of synchronization within the impl) and >> removed the unnecessary locks in the interceptors and things work >> fine. I tried this with some of the few thread-safe cache loader >> impls we have (BDBJE, for example) and this worked well as well. >> >> So what I propose is this: >> >> 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor) >> 2) Make sure we acquire locks (lock() call up the interceptor stack) >> *before* attempting to load node, but after creating temp node >> 3) Make sure cache loader impls are thread safe >> 4) Add thread safety test to CacheLoaderTestBase >> 5) Patch loaders that aren't threadsafe and fail test in (4) - >> FileCacheLoader, JDBCCacheLoader, etc. >> >> I'm going to do this on my local checkout of HEAD and run the >> regression tests overnight. If this works, I suggest backporting >> this to 1.4.x and regressing there. >> >> WDYT? > > -- > Bela Ban > Lead JGroups / JBoss Clustering team > JBoss - a division of Red Hat -- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

-- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

Manik Surtani

9:28 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

Scratch what I said abt the Bdbje CL - it gets it's knickers in a twist as well with enough CPUs/cores - the 4-cpu lab servers got the BDBJE engine in a deadlock with concurrent reads, writes and removes. Will have to apply the same StripedLock access check to this as well. On 19 Jun 2007, at 12:32, Bela Ban wrote:

...

OK, yes, having remote puts and CCL'induced gets not block on the same lock is certainly a much better solution. What will also help is JGroups 2.5 *should* we still encounter a deadlock. BTW: I assume using the Multiplexer here is out of the question ? You could use a separate stack in the Multiplexer (e.g. "udp-repl- sync") which is only used by the CCL, and has no FC in it, and everyone else uses the async stack. But then again, there was a reason we didn't configure the Multiplexer to be the defaut in 2.4... Manik Surtani wrote: > > On 19 Jun 2007, at 09:04, Bela Ban wrote: > >> Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock >> refactoring, sounds like it needs to be done anyway. But, because >> you have to be thread safe at one point, aren't you deferring the >> issue to a later stage ? At one point you have to acquire a lock >> to prevent concurrent loading, and I'm afraid we'll have the same >> issue. >> >> But, wait, if you defer lock acquisition till after the >> synchronous cluster method call, you should be fine. Is this what >> you have in mind ? > > There are 2 locks here that need to be done. One lock is on the > Node, via the lock interceptor, which happens *after* the cluster > method call. This is not the problem. > > We also had a level of sync in the cache loader interceptor to > prevent the same Fqn being loaded twice from the CL (or a > simultaneous load and store, etc). This bit was the problem, > since this sync had to happen *before* any loading was done by the > CL. > > Regarding the other CL impls, some of them already were threadsafe > (BDBJE, Jdbm) and some weren't (JDBC, File). I created a > StripedLock class which uses a set of ReentrantReadWriteLocks > applied to the Fqn in question, and the JDBC and File CLs now use > this to achieve thread safety (more efficient than synchronized > blocks and better mem usage than one lock per Fqn - usual benefits > of lock striping). > > You're right that the improvements in locking here are not quite > central to solving 1103, but somewhat related. > > The real solution to 1103 is in the CCL - remotely originating put > ()'s shouldn't need a load (and hence not need to wait on the Fqn > being loaded) *only if* we're talking about the CCL, since CCLs > don't perform a corresponding store (they're read-only). > Therefore this logic cannot be in the CLI as it doesn't apply to > all CL impls. > > >> >> On the same topic: I looked into interrupting clustered method >> calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533), >> and came to the conclusion that it isn't feasible. When a thread >> is blocked on FC.down(), waiting for credits, then I can >> interrupt it, but it will go right back in its loop waiting for >> credits, therefore blocking again. Unless I change the loop >> condition (running & waiting-for-credits), the thread will always >> block if there aren't enough credits available. >> >> Thoughts ? >> >> >> Manik Surtani wrote: >>> >>> >>> Ok, I spent a bit of time investigating this, and the original >>> sync blocks in the CacheLoaderInterceptor, superceded by the >>> synchronization on an Fqn in subsequent versions are all hacks >>> to get around the fact that some cache loader implementations >>> themselves aren't thread safe. >>> >>> I wrote a test to check the thread safety of cache loader impls, >>> and most of what we have fail. I patched the >>> DummyInMemoryCacheLoader to be thread safe (a lot of >>> synchronization within the impl) and removed the unnecessary >>> locks in the interceptors and things work fine. I tried this >>> with some of the few thread-safe cache loader impls we have >>> (BDBJE, for example) and this worked well as well. >>> >>> So what I propose is this: >>> >>> 1) Patch CLI to NOT use the Fqn locks (in >>> BaseCacheLoaderInterceptor) >>> 2) Make sure we acquire locks (lock() call up the interceptor >>> stack) *before* attempting to load node, but after creating temp >>> node >>> 3) Make sure cache loader impls are thread safe >>> 4) Add thread safety test to CacheLoaderTestBase >>> 5) Patch loaders that aren't threadsafe and fail test in (4) - >>> FileCacheLoader, JDBCCacheLoader, etc. >>> >>> I'm going to do this on my local checkout of HEAD and run the >>> regression tests overnight. If this works, I suggest >>> backporting this to 1.4.x and regressing there. >>> >>> WDYT? >> >> -- >> Bela Ban >> Lead JGroups / JBoss Clustering team >> JBoss - a division of Red Hat > > -- > Manik Surtani > > Lead, JBoss Cache > JBoss, a division of Red Hat > > > -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

-- Manik Surtani Lead, JBoss Cache JBoss, a division of Red Hat

Jason T. Greene

Friday, 15 June Fri, 15 Jun

6:47 a.m.

On Thu, 2007-06-14 at 11:23 +0200, Bela Ban wrote: -snip-

...

The suggestion is to abort the message on such a timeout, not allow it to be sent, which would still be good flow control.

...

On top of that, we *cannot* do that because if we run on top of TCP, a write() might block anyway if the TCP receiver set the sliding window to 0 ! So the sending of data on top of TCP will block (similar to FC) when TCP throttles the sender.

Yes its a shame that Java doesn't set SNDTIMEO for OS's that support, although even though this can happen, it shouldnt be common, and it should be only temporary. -Jason -- Jason T. Greene Lead, POJO Cache JBoss, a division of Red Hat

Bela Ban

7:04 a.m.

New subject: ClusteredCacheLoader deadlocks and JBCCACHE-1103

Jason T. Greene wrote:

...

Yes its a shame that Java doesn't set SNDTIMEO for OS's that support, although even though this can happen, it shouldnt be common, and it should be only temporary.

It's not that uncommon, I just ran into it ! See http://jira.jboss.com/jira/browse/JGRP-532 for details. When a receiver is blocked and doesn't pull data off of the socket, then the senders will sooner or later block on write(). -- Bela Ban Lead JGroups / JBoss Clustering team JBoss - a division of Red Hat

6179

days inactive

6185

days old

jbosscache-dev@lists.jboss.org

Manage subscription

21 comments

6 participants

tags (0)

participants (6)

Bela Ban
Brian Stansberry
Bruno Georges
Jason T. Greene
Jimmy Wilson
Manik Surtani

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

ClusteredCacheLoader deadlocks and JBCCACHE-1103