<div dir="auto">Stuck in CLOSE_WAIT is a symptom of the client-side not properly shutting down [1]. I assume you have no control of your client traffic. Thus, the best you can do is (a) raise fd limit to an unreasonably high number and make sure you have plenty of RAM, (b) put the LB in front of undertow.<div dir="auto"><br></div><div dir="auto">Truth be told we faced the same issue in GCP; unfortunately, GLB still doesn't support max. conn. limit per backend afaik. The next best thing is implementing a custom netfilter using iptables or ebpf. The level of effort is rather high. Note, we're essentially talking about mitigations for a low-level DDoS (perhaps a non-malicious one).</div><div dir="auto"><br></div><div dir="auto">If all the above fail, yet another option is to use http/2. I believe undertow supports it although I have no experience using it.</div><div dir="auto"><br></div><div dir="auto">Best,</div><div dir="auto"><br></div><div dir="auto">stan<br><div dir="auto"><div dir="auto"><br></div><div dir="auto">[1] <a href="https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/" target="_blank" rel="noreferrer">https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/</a></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020, 8:13 AM Nishant Kumar <<a href="mailto:nishantkumar35@gmail.com" target="_blank" rel="noreferrer">nishantkumar35@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto">It feels like CLOSE_WAIT connection are not getting closed properly. Although not sure. If we reduce NO_REQUEST_TIMEOUT to small value, i can see that <span style="font-family:sans-serif">CLOSE_WAIT are comparatively (number decrease faster) low but still very high overall. </span></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020, 2:29 PM Nishant Kumar <<a href="mailto:nishantkumar35@gmail.com" rel="noreferrer noreferrer" target="_blank">nishantkumar35@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Generally, clients also close the connection after a few thousand requests other than normal fatal conditions. There might be other cases too but I am not aware of it. They keep initiating new connections if we are not responding within the threshold time frame. This is a server to server communication system. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020 at 10:26 AM Stuart Douglas <<a href="mailto:sdouglas@redhat.com" rel="noreferrer noreferrer noreferrer" target="_blank">sdouglas@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>This sounds like a bug, when the client closes the connection it should wake up the read listener, which will read -1 and then cleanly close the socket.</div><div><br></div><div>Are the clients closing idle connections or connections processing a request?</div><div><br></div><div>Stuart</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 2 Mar 2020 at 14:31, Nishant Kumar <<a href="mailto:nishantkumar35@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">nishantkumar35@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I agree that it's a load-balancing issue but we can't do much about it at this moment.<br><br>I still see issues after using the latest XNIO (3.7.7) with Undertow. what I have observed it that when there is a spike in request and CONNECTION_HIGH_WATER is reached, the server stops accepting new connection as expected and the client starts to close the connection because of delay (we have strict low latency requirement < 100ms) and try to create new connection again (which will also not be accepted) but server has not closed those connections (NO_REQUEST_TIMEOUT = 6000) and there will be high number of CLOSE_WAIT connections at this moment. The server is considering CLOSE_WAIT + ESTABLISHED for CONNECTION_HIGH_WATER (my understanding). <div><br></div><div>Is there a way that I can close all CLOSE_WAIT connection at this moment so that connection counts drop under CONNECTION_HIGH_WATER and we start responding to newly established connections? or any other suggestions? I have tried removing CONNECTION_HIGH_WATER and relying on the FD limit but that didn't work.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Mar 1, 2020 at 7:47 AM Stan Rosenberg <<a href="mailto:stan.rosenberg@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">stan.rosenberg@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div style="font-size:small">On Sat, Feb 29, 2020 at 8:18 PM Nishant Kumar <<a href="mailto:nishantkumar35@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">nishantkumar35@gmail.com</a>> wrote:<br></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Thanks for the reply. I am running it under supervisord and i have updated open file limit in supervisord config. The problem seems to be same as what @Carter has mentioned. It happens mostly during sudden traffic spike and then sudden increase (~30k-300k) of TIME_WAIT socket. </div></blockquote><div><br></div><div style="font-size:small">The changes in <a href="https://github.com/xnio/xnio/pull/206/files#diff-23a6a7997705ea72e4016c11bf9d214bR453" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/xnio/xnio/pull/206/files#diff-23a6a7997705ea72e4016c11bf9d214bR453</a> are likely to improve the exceptional case of exceeding the file descriptor limit. However, if you're already setting the limit too high (e.g., in our case it was 795588), then exceeding it is a symptom of not properly load-balancing your traffic; with that many connections, you'd better have a ton of free RAM available. </div></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">Nishant Kumar<br>Bangalore, India<br>Mob: +91 80088 42030<br>Email: <a href="mailto:nishantkumar35@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">nishantkumar35@gmail.com</a></div></div>
_______________________________________________<br>
undertow-dev mailing list<br>
<a href="mailto:undertow-dev@lists.jboss.org" rel="noreferrer noreferrer noreferrer" target="_blank">undertow-dev@lists.jboss.org</a><br>
<a href="https://lists.jboss.org/mailman/listinfo/undertow-dev" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">https://lists.jboss.org/mailman/listinfo/undertow-dev</a></blockquote></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">Nishant Kumar<br>Bangalore, India<br>Mob: +91 80088 42030<br>Email: <a href="mailto:nishantkumar35@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">nishantkumar35@gmail.com</a></div></div>
</blockquote></div>
</blockquote></div>