[jboss-jira] [JBoss JIRA] (JGRP-1675) Threads stuck in FlowControl.decrementIfEnoughCredits

Wed Oct 30 12:02:03 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12825808#comment-12825808 ] 

Bela Ban edited comment on JGRP-1675 at 10/30/13 12:00 PM:
-----------------------------------------------------------

Tried to replicate the stress test in JGroups ({{RemoteGetStressTest}} is attached). With 500 threads, the tests completes between 60 and 90 seconds. This is with a random delay de-serializing the BigObject between 1 and 10 ms, and with a DISCARD up-rate of 20%.

If {{USE_SLEEPS}} is changed to false, the test finishes within a few seconds (between 2 and 5 secs).  This is with an OOB pool of min=1/max=5/no-queue.

The delaying and discarding certainly slows things down, but I never experienced a deadlock.

      was (Author: belaban):
    Tried to replicate the stress test in JGroups (RemoteGetStressTest ia attached). With 500 threads, the tests completes within 60 and 90 seconds. This is with a random delay de-serializing the BigObject between 1 and 10 ms, and with a DISCARD up-rate of 20%.

If {{USE_SLEEPS}} is changed to false, the test finishes within a few second (between 2 and 5 secs).  This is with an OOB pool of min=1/max=5/no-queue.

The delaying and discarding certainly slows things downm, but I never experience a deadlock.

> Threads stuck in FlowControl.decrementIfEnoughCredits
> -----------------------------------------------------
>
>                 Key: JGRP-1675
>                 URL: https://issues.jboss.org/browse/JGRP-1675
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.4
>            Reporter: Radim Vansa
>            Assignee: Bela Ban
>             Fix For: 3.5
>
>         Attachments: RemoteGetStressTest.java
>
>
> I have recently observed a repeated situation where many (or all) threads have been stuck waiting for credits in FlowControl protocol.
> The credit request was not handled on the other node as this is non-oob message and some (actually many of them - cause unknown) messages before the request have been lost - therefore the request was waiting for them to be re-sent.
> However, these have not been re-sent properly as the retransmission request was not received - all OOB threads were stuck in the FlowControl protocol as these handled some other request and tried to send a response - but the response could not be sent until FlowControl gets the credits.
> The probability of such situation could be lowered by tagging the credit request to be OOB - then it would be handled immediately. If the credit replenish message would then be processed in regular OOB pool, this could get already depleted by many requests, but setting up the internal thread pool would solve the problem.
> Other consideration would be to allow releasing thread from FlowControl (let it send the message even without credits) if it waits there for too long.
> h3. Workaround
> It appears that setting MFC and UFC max_credits to 10M or removing these protocols at all is a workaround for this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira