[jboss-jira] [JBoss JIRA] (JGRP-1675) CreditRequest in FlowControl is not OOB

Tue Sep 17 10:49:04 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805131#comment-12805131 ] 

Radim Vansa edited comment on JGRP-1675 at 9/17/13 10:48 AM:
-------------------------------------------------------------

A shouldn't need to send the request because B will send replenishment to A as it finds out that A has sent enough messages to B to lower the credit count under threshold. However, if messages on B are not received (probably not due to network failure but because of OOB TP depletion), it won't find out that A has not enough credits.
Eventually, B should send XMIT_REQ to A and A should resend the messages, but if B has almost depleted OOB, it will discard most of them as there won't be enough threads to process them.
Now, the question is why the OOB is depleted. Analyzing stacks for B, I've found that it has most OOB threads stuck waiting for credits - so B happens to be in A's position. Sometimes it's waiting for credits to another node, sometimes just to A, but there's a cycle.

I don't say that I have a full explanation what happened in the system so that we've come to this situation - it sounds like petitio principii, right? - but it's happening.

Regarding the test - I understand that minimal example is always better, but would be RadarGun benchmark + ispn config + jgroups config where this happens sufficient?

      was (Author: rvansa):
    A shouldn't need to send the request because B will send replenishment to A as it finds out that A has sent enough messages to B to lower the credit count under threshold. However, if messages on B are not received (probably not due to network failure but because of OOB TP depletion), it won't find out that A has not enough credits.
Eventually, B should send XMIT_REQ to A and A should resend the messages, but if B has almost depleted OOB, it will discard most of them as there won't be enough threads to process them.
Now, the question is why the OOB is depleted. Analyzing stacks for B, I've found that it has most OOB threads stuck waiting for credits - so B happens to be in A's position. Sometimes it's waiting for credits to another node, sometimes just to A, but there's a cycle.

I don't say that I have a full explanation what happened in the system so that we've come to this situation.

Regarding the test - I understand that minimal example is always better, but would be RadarGun benchmark + ispn config + jgroups config where this happens sufficient?

> CreditRequest in FlowControl is not OOB
> ---------------------------------------
>
>                 Key: JGRP-1675
>                 URL: https://issues.jboss.org/browse/JGRP-1675
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.4
>            Reporter: Radim Vansa
>            Assignee: Bela Ban
>             Fix For: 3.4
>
>
> I have recently observed a repeated situation where many (or all) threads have been stuck waiting for credits in FlowControl protocol.
> The credit request was not handled on the other node as this is non-oob message and some (actually many of them - cause unknown) messages before the request have been lost - therefore the request was waiting for them to be re-sent.
> However, these have not been re-sent properly as the retransmission request was not received - all OOB threads were stuck in the FlowControl protocol as these handled some other request and tried to send a response - but the response could not be sent until FlowControl gets the credits.
> The probability of such situation could be lowered by tagging the credit request to be OOB - then it would be handled immediately. If the credit replenish message would then be processed in regular OOB pool, this could get already depleted by many requests, but setting up the internal thread pool would solve the problem.
> Other consideration would be to allow releasing thread from FlowControl (let it send the message even without credits) if it waits there for too long.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira