[
https://issues.jboss.org/browse/JGRP-1675?page=com.atlassian.jira.plugin....
]
Radim Vansa edited comment on JGRP-1675 at 11/6/13 3:16 AM:
------------------------------------------------------------
I'd prefer JGroups to be always working rather than working with probability dependent
on INT:OOB threadpool ratio/messages frequency/dark magic. INT TP has much lower value if
regular messages which are non-blocking from application perspective may get the thread
stuck -> lead to threadpool depletion. I agree with Dan that listing messages according
to flags would make the system more reliable.
[~belaban]: Could you specify what's the fix (as this is fixed in 3.5)? Marking the
FlowControl messages ad DONT_BUNDLE, telling ISPN developers to fix GET_FIRST flooding or
something else? Does the comment about the two tests passing mean "passing as long as
these send the get with GET_ALL" (the number of targets doesn't matter)?
was (Author: rvansa):
I'd prefer JGroups to be always working rather than working with probability
dependent on INT:OOB threadpool ratio/messages frequency/dark magic. INT TP has much lower
value if regular messages which are non-blocking from application perspective may get the
thread stuck -> lead to threadpool depletion. I agree with dan that listing messages
according.
[~belaban]: Could you specify what's the fix (as this is fixed in 3.5)? Marking the
FlowControl messages ad DONT_BUNDLE, telling ISPN developers to fix GET_FIRST flooding or
something else? Does the comment about the two tests passing mean "passing as long as
these send the get with GET_ALL" (the number of targets doesn't matter)?
Threads stuck in FlowControl.decrementIfEnoughCredits
-----------------------------------------------------
Key: JGRP-1675
URL:
https://issues.jboss.org/browse/JGRP-1675
Project: JGroups
Issue Type: Bug
Affects Versions: 3.4
Reporter: Radim Vansa
Assignee: Bela Ban
Fix For: 3.5
Attachments: jgroups-udp-radim.xml, RemoteGetStressTest.java, UPerf2.java
I have recently observed a repeated situation where many (or all) threads have been stuck
waiting for credits in FlowControl protocol.
The credit request was not handled on the other node as this is non-oob message and some
(actually many of them - cause unknown) messages before the request have been lost -
therefore the request was waiting for them to be re-sent.
However, these have not been re-sent properly as the retransmission request was not
received - all OOB threads were stuck in the FlowControl protocol as these handled some
other request and tried to send a response - but the response could not be sent until
FlowControl gets the credits.
The probability of such situation could be lowered by tagging the credit request to be
OOB - then it would be handled immediately. If the credit replenish message would then be
processed in regular OOB pool, this could get already depleted by many requests, but
setting up the internal thread pool would solve the problem.
Other consideration would be to allow releasing thread from FlowControl (let it send the
message even without credits) if it waits there for too long.
h3. Workaround
It appears that setting MFC and UFC max_credits to 10M or removing these protocols at all
is a workaround for this issue.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira