[jboss-jira] [JBoss JIRA] Commented: (JGRP-464) Credit request storms from flow control protocols

Mon Apr 16 16:09:56 EDT 2007

    [ http://jira.jboss.com/jira/browse/JGRP-464?page=comments#action_12359491 ] 

Brian Stansberry commented on JGRP-464:
---------------------------------------

A couple notes on this:

1) The tests I mentioned were of http session replication where the client drivers were capable of sending 500 concurrent requests per server, and the Tomcat thread pool was large enough to accomodate that.  6 servers in the cluster, so overall the cluster would be servicing 3000 clients. The intent of the test was to deliberately overload the system for sustained periods.

Effect of this is you could have hundreds of threads blocking in FC.handleDownMessage().  This was tested with Branch_2_4, which uses concurrent.jar's ReentrantLock, which is not fair.  So, it's possible for a thread to be waiting for credit, and credit is received, but other more recently arrived threads consume all the credits.  So, under sustained overload, threads timing out of the block wait were not uncommon. I would see dozens of these threads time out from blocking and in series reacquire the lock and send a credit request, one after the other, all in short order.

2) I was wrong about NAKACK ignoring headers in its max_xmit_size check. TBH, I don't know why I saw the UDP errors. :(  The error occurred very consistently with Branch_2_4 code.  The error message was an IOException indicating a packet was too large.  In any case, once I prevented the "credit request storms" I no longer saw the errors.

> Credit request storms from flow control protocols
> -------------------------------------------------
>
>                 Key: JGRP-464
>                 URL: http://jira.jboss.com/jira/browse/JGRP-464
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: Brian Stansberry
>         Assigned To: Bela Ban
>            Priority: Critical
>             Fix For: 2.5
>
>
> If an application has a set of threads that are trying to send more messages than the consumers can handle, request threads block in FC or SFC waiting for credits.  If they wait too long, they wake up and send a message requesting credit.  With the tests I'm running, large numbers of threads would block, and then one after another time out, wake up and ask for credit, allow within a very short period.  Basically spamming the cluster asking for credit (and getting credit back for each request). 
> That seems inefficient, but with SFC it was leading to error conditions.  Seems some other server's NAKACK requested retransmission of a set of messages, some or all of which were a large number of these "spam" credit requests.  The credit requests basically had no message body, just headers, so the NAKACK.max_xmit_size check wasn't assigning them any weight.  Effect was the resulting retransmission message was > 64K and couldn't be sent, resulting in an ERROR in UDP.
> Possible solution is to add a min_credit_request_interval such that a thread that wakes up from blocking will not request credit it another thread has already done so within the configured time.  For reference, ,my port of SFC to Branch_4_2 has an implementation of that concept.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira