[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-6799:
-----------------------------------
On Sanne's proposal... I 100% agree that requests should *not* be modeled with threads - simply because there should be 1000s of requests in parallel and much less threads.
But about bundling responses, I am not confident that it will improve response time or throughput. I wouldn't prioritize such bundler.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-6799:
------------------------------------
[~sannegrinovero] sounds way too similar to the purpose of the JGroups bundler to me. One thing I do agree it is having multiple threads to do this. JGroups bundlers usually have a single thread sending all the messages, and I think it can become a bottleneck in some cases. But I still think JGroups is the proper place to experiment with this, not Infinispan.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-6799:
-----------------------------------
{quote}> No more context switches on the hot path, please...
I'd say once the OOB thread pool is getting full, we're no longer on the hot path{quote}
True, I was talking about executing all reads in the remote thread pool. Checking status of TP and deciding seems to me rather complicated, and prefer a simple clean design rather than such workarounds.
I'll add our IRC chat to complement..
{quote}
(02:10:13 PM) dberindei: rvansa: I have also asked Bela about non-blocking FC, but he says that would defeat the purpose of FC :)
(02:10:53 PM) rvansa: dberindei: I don't agree - it won't send any further message, so it does block
(02:11:05 PM) emmanuel is now known as emmanuel_off
(02:11:17 PM) rvansa: dberindei: I mean 'block' in the sense of the purpose
(02:11:45 PM) dberindei: rvansa: if you have a thread that sends GET_NONE messages, the only way to really throttle it is to block
(02:11:47 PM) rvansa: dberindei: I agree that app threads should be blocked in FC
(02:13:10 PM) dberindei: rvansa: I think our threads should be prevented from sending too many messages, too
(02:13:31 PM) dberindei: rvansa: I agree that blocking is bad, but we need to come up with an alternative
(02:14:36 PM) rvansa: dberindei: you will throttle responses through the queue,
(02:14:52 PM) dberindei: rvansa: wouldn't the queue also block in order to throttle?
(02:15:52 PM) rvansa: dberindei: the thing is that, assuming limited number of app threads, they'll still be throttled on the remote side, because they won't get the RPC response
(02:16:34 PM) dberindei: rvansa: yeah, except in our tests we usually have more client threads than OOB threads :)
(02:17:01 PM) rvansa: and?
(02:17:35 PM) dberindei: rvansa: and you still get way too many messages in that queue, I think
(02:18:05 PM) dberindei: rvansa: it should be based on the capacity of the "server", not on the capacity of the client to send requests
(02:19:21 PM) dberindei: rvansa: anyway, I'm not the one you need to convince :)
(02:19:26 PM) rvansa: dberindei: It's not client-server, it's p2p
(02:19:47 PM) rvansa: dberindei: you'll apply backpressure to the app threads, when they won't get the response. True, if the app starts won't react to that, the other party will blow up as the queue won't be able to handle that
(02:20:30 PM) dberindei: rvansa: exactly, we need to handle async operations just as well as sync operations
(02:20:45 PM) dberindei: rvansa: it won't be long until the HotRod server is also async
(02:20:55 PM) rvansa: dberindei: But then you need another mechanism to block the demanding app
(02:21:05 PM) rvansa: dberindei: deadlock != throttling
(02:21:43 PM) dberindei: rvansa: I'm not saying we have to deadlock, I'm just saying we need to throttle our threads as well, not just the application threads
(02:23:42 PM) rvansa: And btw., it's not up to Bela to decide what Infinispan needs, you can always write your ASYNC_UFC/MFC. Even without the ugly capital letters :)
(02:23:50 PM) rvansa: dberindei: I don't agree here
(02:24:10 PM) rvansa: dberindei: having zillion of threads blocked won't help anyone
{quote}
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-6799:
------------------------------------
> No more context switches on the hot path, please...
I'd say once the OOB thread pool is getting full, we're no longer on the hot path :)
> What about having async FC for responses only? If it's a response and we run out of credits, the response would be queued, instead of blocking the thread.
How is that better than using NO_FC for response messages?
> I think that the core concept for future versions of Infinispan is that only application threads can block.
I agree. But I also think we need a way to throttle the remote threads without unlimited queues -- those are a sure way to OOME.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Sanne Grinovero (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Sanne Grinovero commented on ISPN-6799:
---------------------------------------
Rather than modelling each "need to reply" with a thread, couldn't we have a clever data structure which collects all the responses-to-be-sent.
I believe this was shared during past face to face meetings, so to recap the high level concepts:
You'd have a limited number of threads periodically scanning this and sending them out in batches.
The collection would need to be:
- somewhat fairly ordered to make sure that no response waits too long
- efficiently drop those responses which already waited too long
- organised by target address, so it can be scanned efficiently for multiple answers which need to be batched to the same address: there's high likelihood for many responses to need to in the same "direction"
Contention, context switches (and complexity) could be minimised by doing all this in a single thread in a disruptor-style queue, at most sharding by a combination of segment-id and target-address to achieve multiple "shared nothing" structures, maintaining each queue as a fully independent highway lane.
This would also remove some pressure from the batching done at JGroups level as many small messages will already be delivered "packed by target" at JGroups - we might even consider disabling the JGroups batch features and that would improve latency.
On top of this, I believe that such a design would allow us to (in future) enhance JGroups to not always need to hold on the entire message payload for the retransmission table but rather keep a hook to the Infinispan's "queue of outbound messages", re-marshalling on demand; this might also need an entry "snapshot" mechanism to re-transmit fragments of the original value if it changed, or a protocol improvement to rather undo the other fragments and re-send with the updated value.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa edited comment on ISPN-6799 at 6/27/16 7:14 AM:
------------------------------------------------------------
IIUC the queues in TPE, these don't work well for our purposes as the threads are added (above core level) only if the queue is full. A behavior I'd prefer to see for 'queued' TPE is to use the queue only when it has active == max threads (then it should queue the task in possibly unlimited, or preferably timing-out queue). Have you ever experiment with something like this? (there's no OOTB solution for this, though - you'd have to implement it through {{RejectionExecutionHandler}})
was (Author: rvansa):
IIUC the queues in TPE, these don't work well for our purposes as the threads are added (above core level) only if the queue is full. A behavior I'd prefer to see for 'queued' TPE is to use the queue only when it has active == max threads (then it should queue the task in possibly unlimited, or preferably timing-out queue).
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-6799:
-----------------------------------
IIUC the queues in TPE, these don't work well for our purposes as the threads are added (above core level) only if the queue is full. A behavior I'd prefer to see for 'queued' TPE is to use the queue only when it has active == max threads (then it should queue the task in possibly unlimited, or preferably timing-out queue).
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-6799:
-----------------------------------
No more context switches on the hot path, please... What about having async FC for responses only? If it's a response and we run out of credits, the response would be queued, instead of blocking the thread. I think that the core concept for future versions of Infinispan is that only application threads can block.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6733) XML Serializer does not serialize attributes for org.infinispan.persistence.cluster.ClusterLoader
by Anna Manukyan (JIRA)
[ https://issues.jboss.org/browse/ISPN-6733?page=com.atlassian.jira.plugin.... ]
Anna Manukyan updated ISPN-6733:
--------------------------------
Status: Pull Request Sent (was: Open)
Git Pull Request: https://github.com/infinispan/infinispan/pull/4423
> XML Serializer does not serialize attributes for org.infinispan.persistence.cluster.ClusterLoader
> -------------------------------------------------------------------------------------------------
>
> Key: ISPN-6733
> URL: https://issues.jboss.org/browse/ISPN-6733
> Project: Infinispan
> Issue Type: Bug
> Components: Configuration
> Reporter: Anna Manukyan
> Assignee: Anna Manukyan
>
> When the clusterLoader is configured using {{store}} tag, the specified attributes are not parsed to the new version xml.
> The following xml:
> {code}
> .................................................
> <namedCache name="withClusterLoader1">
> <persistence>
> <store class="org.infinispan.persistence.cluster.ClusterLoader" preload="true" fetchPersistentState="true" ignoreModifications="true" purgeOnStartup="true" shared="true">
> <properties>
> <property name="remoteCallTimeout" value="15000" />
> </properties>
> </store>
> </persistence>
> </namedCache>
> .....................
> {code}
> is parsed to:
> {code}
> ...............................
> <local-cache name="withClusterLoader1" statistics="false">
> <persistence>
> <cluster-loader remote-timeout="15000">
> <property name="remoteCallTimeout">
> 15000
> </property>
> </cluster-loader>
> </persistence>
> </local-cache>
> .............................
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months
[JBoss JIRA] (ISPN-6799) OOB thread pool fills with threads trying to send remote get responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-6799?page=com.atlassian.jira.plugin.... ]
Radim Vansa updated ISPN-6799:
------------------------------
Description:
Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
was:
Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
UFC/MFC can send a {{CREDIT_REQUES}}T message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
> OOB thread pool fills with threads trying to send remote get responses
> ----------------------------------------------------------------------
>
> Key: ISPN-6799
> URL: https://issues.jboss.org/browse/ISPN-6799
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.0.0.Alpha2, 8.2.2.Final
> Reporter: Dan Berindei
> Fix For: 9.0.0.Alpha3
>
>
> Note: This is a scenario that happens in the stress tests, with 4 nodes in dist mode, and 200+ threads per node doing only reads. I have not been able to reproduce it locally, even with a much lower OOB thread pool size and UFC.max_credits.
> We don't use the {{NO_FC}} flag, so threads sending both requests and responses can block in UFC/MFC. Remote gets are executed directly on the OOB thread, so when we run out of credits for one node, the OOB pool can quickly become full with threads waiting to send a remote get response to that node.
> While we can't send responses to that node, we won't send credits to it, either, as credits are only sent *after* the message has been processed by the application. That means OOB threads on all nodes will start blocking, trying to send remote get responses to us.
> This is made a worse by our staggering of remote gets. As remote get responses block, the stagger timeout kicks in and we send even more remote gets, making it even harder for the system to recover.
> UFC/MFC can send a {{CREDIT_REQUEST}} message to ask for more credits. The {{REPLENISH}} messages are handled on JGroups' internal thread pool, so they are not blocked. However, the CREDIT_REQUEST can be sent at most once every {{UFC.max_block_time}} ms, so they can't be relied on to provide enough credits. With the default settings, the throughput would be {{max_credits / max_block_time == 2mb / 0.5s == 4mb/s}}, which is really small compared to regular throughput.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 6 months