[JBoss JIRA] (JGRP-2143) TP: use only one thread per member to pass up regular messages

Thursday, 8 December 2016

     [
https://issues.jboss.org/browse/JGRP-2143?page=com.atlassian.jira.plugin....
]

Bela Ban updated JGRP-2143:
---------------------------
    Description: 
This applies only to _regular_ messages; OOB and internal messages are processed by
passing them to the thread pool directly when they've been received.

The processing of a message received from B is as follows:
* A regular message (or message batch) is assigned a thread from the thread pool and
passed up to the reliability protocol, e.g. NAKACK2 or UNICAST3.
* There is is added to the table for B. 
* The thread sees if another thread is already delivering messages from B to the
application. If not, it grabs as many consecutive (ordered) messages from the table and
delivers them to the application. Otherwise, it returns and can be assigned other tasks.

The problem here is that more than one thread may be passing up messages from a given
sender B; only at the NAKACK2 or UNICAST3 level will a single thread be selected to
deliver the messages to the application.

This causes higher thread pool usage than required, with all of its drawbacks, e.g. more
context switching, higher contention on adding messages to the table for B, and possibly
exhaustion of the thread pool.

An example of where service is denied or delayed:

* We have a cluster of \{A,B,C,D\}
* A receives 10 messages from B, 4 from C and 1 from D
* The thread pool's max size is 20
* The 10 messages from B are processed; all 10 threads add their messages to the table,
but only 1 delivers them to the application and the other 9 return to the pool
* 4 messages from C are added to C's table, 1 thread delivers them and 3 return
* The 1 message from D is added to D's table and the same thread is used to deliver
the message up the stack to the application

So while we receive 15 messages, effectively only 3 threads are needed to deliver them to
the application: as these are regular messages, they need to be delivered in _sender
order_. 

The 9 threads which process messages from B are only adding them to B's table and then
return immediately. This causes increased context switching, plus more contention on
B's table (which is synchronized), and possibly exhaustion of the thread pool. For
example, if the pool's max size was only 10, then processing the first 10 messages
from B would exhaust the table, and the other messages from C and D would be processed in
newly spawned threads.

SOLUTION

* (Only applicable to _regular_ messages)
* When a message (or batch) from sender P is received, we check if another thread is
already passing up messages from B. If not, we pass the message up by grabbing a thread
from the thread pool. This will add the message to P's table and deliver as many
messages (fromed from the table) as possible to the application.
* If there's currently a thread delivering P's message, we simply add the message
(or batch) to a queue for P and return.
* When the delivery thread returns, it checks the queue for P and delivers all queued
messages, or returns if the queue is empty.
* (The queue is actually a MessageBatch, and new messages are simply appended to it. On
delivery, the batch is cleared)

The effects of this for regular messages is
* Fewer threads: the thread pool only has a max of <cluster-members> threads for
regular messages where <cluster-members> is the number of members in the cluster
from whom we concurrently are receiving messages. E.g. for a cluster \{A,B,C,D\}, if
we're receiving messages at the same time from all members, then the max size is 4.
** Of course, OOB and internal messages, plus timer tasks will add to this number.
* Less contention on the table for a given member: instead of 10 threads all adding their
messages to B's table (contention on the table lock) and then CASing a boolean, only 1
thread ever adds and removes messages to/from the table. This means uncontended (= fast)
lock acquisition for regular messages (of course, if we use OOB messages, then we do have
contention).
* Appending to a batch is much faster then adding to a table
* The downside is that we're storing messages actually twice: once in the batch for P
and once in P's table. But these are arrays of pointers, so not a lot of memory
required.

Example: the 10 threads for messages from B above, will create a batch of 9 messages in
B's queue and grab 1 thread from the pool to deliver its message. When the thread is
done, it will grab the message batch of 9 and also add it to the table and deliver it.

  was:
This applies only to _regular_ messages; OOB and internal messages are processed by
passing them to the thread pool directly when they've been received.

The processing of a message received from B is as follows:
* A regular message (or message batch) is assigned a thread from the thread pool and
passed up to the reliability protocol, e.g. NAKACK2 or UNICAST3.
* There is is added to the table for B. 
* The thread sees if another thread is already delivering messages from B to the
application. If not, it grabs as many consecutive (ordered) messages from the table and
delivers them to the application. Otherwise, it returns and can be assigned other tasks.

The problem here is that more than one thread may be passing up messages from a given
sender B; only at the NAKACK2 or UNICAST3 level will a single thread be selected to
deliver the messages to the application.

This causes higher thread pool usage than required, with all of its drawbacks, e.g. more
context switching, higher contention on adding messages to the table for B, and possibly
exhaustion of the thread pool.

An example of where service is denied or delayed:

* We have a cluster of \{A,B,C,D\}
* A receives 10 messages from B, 4 from C and 1 from D
* The thread pool's max size is 20
* The 10 messages from B are processed; all 10 threads add their messages to the table,
but only 1 delivers them to the application and the other 9 return to the pool
* 4 messages from C are added to C's table, 1 thread delivers them and 3 return
* The 1 message from D is added to D's table and the same thread is used to deliver
the message up the stack to the application

So while we receive 15 messages, effectively only 3 threads are needed to deliver them to
the application: as these are regular messages, they need to be delivered in _sender
order_. 

The 9 threads which process messages from B are only adding them to B's table and then
return immediately. This causes increased context switching, plus more contention on
B's table (which is synchronized), and possibly exhaustion of the thread pool. For
example, if the pool's max size was only 10, then processing the first 10 messages
from B would exhaust the table, and the other messages from C and D would be processed in
newly spawned threads.

SOLUTION

* (Only applicable to _regular_ messages)
* When a message (or batch) from sender P is received, we check if another thread is
already passing up messages from B. If not, we pass the message up by grabbing a thread
from the thread pool. This will add the message to P's table and deliver as many
messages (fromed from the table) as possible to the application.
* If there's currently a thread delivering P's message, we simply add the message
(or batch) to a queue for P and return.
* When the delivery thread returns, it checks the queue for P and delivers all queued
messages, or returns if the queue is empty.
* (The queue is actually a MessageBatch, and new messages are simply appended to it. On
delivery, the batch is cleared)

The effects of this for regular messages is
* Fewer threads: the thread pool only has a max of <cluster-members> threads for
regular messages where <cluster-members> is the number of members in the cluster
from whom we concurrently are receiving messages. E.g. for a cluster \{A,B,C,D\}, if
we're receiving messages at the same time from all members, then the max size is 4.
** Of course, OOB and internal messages, plus timer tasks will add to this number.
* Less contention on the table for a given member: instead of 10 threads all adding their
messages to B's table (contention on the table lock) and then CASing a boolean, only 1
thread ever adds and removes messages to/from the table. This means uncontended (= fast)
lock acquisition for regular messages (of course, if we use OOB messages, then we do have
contention).
* 

...
 TP: use only one thread per member to pass up regular messages
 --------------------------------------------------------------

                 Key: JGRP-2143
                 URL: https://issues.jboss.org/browse/JGRP-2143
             Project: JGroups
          Issue Type: Enhancement
            Reporter: Bela Ban
            Assignee: Bela Ban
             Fix For: 4.0

 This applies only to _regular_ messages; OOB and internal messages are processed by
passing them to the thread pool directly when they've been received.
 The processing of a message received from B is as follows:
 * A regular message (or message batch) is assigned a thread from the thread pool and
passed up to the reliability protocol, e.g. NAKACK2 or UNICAST3.
 * There is is added to the table for B. 
 * The thread sees if another thread is already delivering messages from B to the
application. If not, it grabs as many consecutive (ordered) messages from the table and
delivers them to the application. Otherwise, it returns and can be assigned other tasks.
 The problem here is that more than one thread may be passing up messages from a given
sender B; only at the NAKACK2 or UNICAST3 level will a single thread be selected to
deliver the messages to the application.
 This causes higher thread pool usage than required, with all of its drawbacks, e.g. more
context switching, higher contention on adding messages to the table for B, and possibly
exhaustion of the thread pool.
 An example of where service is denied or delayed:
 * We have a cluster of \{A,B,C,D\}
 * A receives 10 messages from B, 4 from C and 1 from D
 * The thread pool's max size is 20
 * The 10 messages from B are processed; all 10 threads add their messages to the table,
but only 1 delivers them to the application and the other 9 return to the pool
 * 4 messages from C are added to C's table, 1 thread delivers them and 3 return
 * The 1 message from D is added to D's table and the same thread is used to deliver
the message up the stack to the application
 So while we receive 15 messages, effectively only 3 threads are needed to deliver them to
the application: as these are regular messages, they need to be delivered in _sender
order_. 
 The 9 threads which process messages from B are only adding them to B's table and
then return immediately. This causes increased context switching, plus more contention on
B's table (which is synchronized), and possibly exhaustion of the thread pool. For
example, if the pool's max size was only 10, then processing the first 10 messages
from B would exhaust the table, and the other messages from C and D would be processed in
newly spawned threads.
 SOLUTION
 * (Only applicable to _regular_ messages)
 * When a message (or batch) from sender P is received, we check if another thread is
already passing up messages from B. If not, we pass the message up by grabbing a thread
from the thread pool. This will add the message to P's table and deliver as many
messages (fromed from the table) as possible to the application.
 * If there's currently a thread delivering P's message, we simply add the message
(or batch) to a queue for P and return.
 * When the delivery thread returns, it checks the queue for P and delivers all queued
messages, or returns if the queue is empty.
 * (The queue is actually a MessageBatch, and new messages are simply appended to it. On
delivery, the batch is cleared)
 The effects of this for regular messages is
 * Fewer threads: the thread pool only has a max of <cluster-members> threads for
regular messages where <cluster-members> is the number of members in the cluster
from whom we concurrently are receiving messages. E.g. for a cluster \{A,B,C,D\}, if
we're receiving messages at the same time from all members, then the max size is 4.
 ** Of course, OOB and internal messages, plus timer tasks will add to this number.
 * Less contention on the table for a given member: instead of 10 threads all adding their
messages to B's table (contention on the table lock) and then CASing a boolean, only 1
thread ever adds and removes messages to/from the table. This means uncontended (= fast)
lock acquisition for regular messages (of course, if we use OOB messages, then we do have
contention).
 * Appending to a batch is much faster then adding to a table
 * The downside is that we're storing messages actually twice: once in the batch for P
and once in P's table. But these are arrays of pointers, so not a lot of memory
required.
 Example: the 10 threads for messages from B above, will create a batch of 9 messages in
B's queue and grab 1 thread from the pool to deliver its message. When the thread is
done, it will grab the message batch of 9 and also add it to the table and deliver it.

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006