]
Bela Ban resolved JGRP-2143.
----------------------------
Resolution: Done
TP: use only one thread per member to pass up regular messages
--------------------------------------------------------------
Key: JGRP-2143
URL:
https://issues.jboss.org/browse/JGRP-2143
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 4.0
This applies only to _regular_ messages; OOB and internal messages are processed by
passing them to the thread pool directly when they they're received.
The processing of a message received from B is as follows:
* A regular message (or message batch) is assigned a thread from the thread pool and
passed up to the reliability protocol, e.g. NAKACK2 or UNICAST3.
* There is is added to the table for B.
* The thread sees if another thread is already delivering messages from B to the
application. If not, it grabs as many consecutive (ordered) messages from the table as it
can and delivers them to the application. Otherwise, it returns and can be assigned other
tasks.
The problem here is that more than one thread may be passing up messages from a given
sender B; only at the NAKACK2 or UNICAST3 level will a single thread be selected to
deliver the messages to the application.
This causes higher thread pool usage than required, with all of its drawbacks, e.g. more
context switching, higher contention on adding messages to the table for B, and possibly
exhaustion of the thread pool.
An example of where service is denied or delayed:
* We have a cluster of \{A,B,C,D\}
* A receives 10 messages from B, 4 from C and 1 from D
* The thread pool's max size is 20
* The 10 messages from B are processed; all 10 threads add their messages to the table,
but only 1 delivers them to the application and the other 9 return to the pool
* 4 messages from C are added to C's table, 1 thread delivers them and 3 return
* The 1 message from D is added to D's table and the same thread is used to deliver
the message up the stack to the application
So while we receive 15 messages, effectively only 3 threads are needed to deliver them to
the application: as these are regular messages, they need to be delivered in _sender
order_.
The 9 threads which process messages from B are only adding them to B's table and
then return immediately. This causes increased context switching, plus more contention on
B's table (which is synchronized), and possibly exhaustion of the thread pool. For
example, if the pool's max size was only 10, then processing the first 10 messages
from B would exhaust the table, and the other messages from C and D would be processed in
newly spawned threads.
SOLUTION
* (Only applicable to _regular_ messages)
* When a message (or batch) from sender P is received, we check if another thread is
already passing up messages from B. If not, we pass the message up by grabbing a thread
from the thread pool. This will add the message to P's table and deliver as many
messages (from from the table) as possible to the application.
* If there's currently a thread delivering P's message, we simply add the message
(or batch) to a queue for P and return.
* When the delivery thread returns, it checks the queue for P and delivers all queued
messages, or returns if the queue is empty.
* (The queue is actually a MessageBatch, and new messages are simply appended to it. On
delivery, the batch is cleared)
The effects of this for regular messages are
* Fewer threads: the thread pool only has a max of <cluster-members> threads for
regular messages where <cluster-members> is the number of members in the cluster
from whom we are concurrently receiving messages. E.g. for a cluster \{A,B,C,D\}, if
we're receiving messages at the same time from all members, then the max size is 4.
** Of course, OOB and internal messages, plus timer tasks will add to this number.
* Less contention on the table for a given member: instead of 10 threads all adding their
messages to B's table (contention on the table lock) and then CASing a boolean, only 1
thread ever adds and removes messages to/from the table. This means uncontended (= fast)
lock acquisition for regular messages (of course, if we use OOB messages, then we do have
contention).
* Appending to a batch is much faster then adding to a table
* The downside is that we're storing messages actually twice: once in the batch for P
and once in P's table. But these are arrays of pointers, so not a lot of memory
required.
Example: the 10 threads for messages from B above, will create a batch of 9 messages in
B's queue and grab 1 thread from the pool to deliver its message. When the thread is
done, it will grab the message batch of 9 and also add it to the table and deliver it.