Re: [undertow-dev] File Serving Performance and Research Details

Monday, 10 September 2012

Jason,

    A question on the big memory buffer.  Can that be setup to leverage large page memory?
 That would have the added advantage of fewer TLB misses, and it wouldn't be swappable
to disk.  Also, the newer Intel and AMD chips support 1GB HugeTLB, which even further
reduces the TLB misses, from testing that has been done by Shak's team.  I have yet to
take another workload and try it yet, but we will be trying it on SPECjEnterprise2010 in
the near future.

Andy

----- Original Message -----
...
 From: "Jason Greene" <jason.greene(a)redhat.com&gt;
 To: undertow-dev(a)lists.jboss.org
 Sent: Monday, September 10, 2012 3:47:32 PM
 Subject: [undertow-dev] File Serving Performance and Research Details

 Hello Everyone,

 On and off for the past couple of weeks I have been working on the
 file serving implementation in undertow. This lead to lots and lots
 of benchmarking, which in turn lead to a lot of bug and perf fixes
 in various areas of the web server and xnio.

 The outcome that seems to work the best for what we have available in
 Java NIO is a caching / sendfile mix approach. The maintenance of
 the cache is completely non-blocking and relies on a modified
 concurrent direct deque, which lets us delete in the middle. This
 allows an access list to be stored. In order to further reduce
 possible contention, we sample access at 5 request intervals
 (requests % 5 = do stuff).

 Blocking transfer process (default)
 ------------------------------------
 1. If entry is cached jump to non-blocking cached transfer process
 2. Otherwise kick off the file operations to a task on a worker
 thread
 3. If this is a head operation, the task simply executes a stat call
 and returns the appropriate details. (Note that stat calls can block
 [metadata read], which is why it's done the same as a transfer [in a
 workerthread])
 4. If the file has not been accessed at least 5 times recently, or
 there is no cache space, or it is too big of a file, then it is then
 transferred in a blocking mode using FileChannel.transferTo, which
 under the hood uses sendfile, or other OS file transfer optimized
 calls.
 5. Otherwise the file is buffered and cached and then transferred
 using scattering writes. The caching process will attempt to reclaim
 "older" cache base following an LRU like approach.

 Non-blocking transfer process (when cached)
 -------------------------------------------
 1. All cache entries are stored in blocks (slices) within a big
 direct memory buffer. This uses native memory outside of the Xmx
 settings of Java, and has the advantage that it can be written
 directly to a socket without copying.
 2. When they are retrieved they are reference counted as a group to
 prevent reclamation from corrupting the to be transferred state.
 3. The buffers are attempted to be written in one scattering write
 unless the socket send buffer fills.
 4. If the send buffer is full, an event listener is registered, and
 will be executed in async non-blocking fashion later
 5. The remaining portion, if any is transferred, and the ref counts
 are restored

 Results
 -------
 On a dual-core intel i7 system (Stuart's laptop), we easily get over
 100k requests per second on small files (808 bytes) using the
 loopback device. Testing a variety of sizes we overall push around 1
 gigabytes a seconds. My older core 2 quad system (Q6700 CPU) does
 around 80k eps and around 700-800 MB/s. There are some limitations
 we run into with the load driver (currently using httperf). Httperf
 can only use one CPU, so http pipelining (sending multiple requests
 on the same connection) is necessary to drive that level of load.
 Performance scales well with a large number of connections. I can
 drive close to the same traffic with 10k connections, but the
 connection setup time and maintenance adds a bit of cost.

 Another interesting aspect is OS overhead. Tomaz was able to improve
 his results by using an ethernet adapter over a loopback, and
 multiple hosts. This is likely because the TCP stack was half as
 busy. Also connection tracking in iptables has a big effect (almost
 5-6%), so disabling it helps quite a bit

 Future Research Possibilities
 ----------------------------
 It appears we could support AIO and non-blocking logic across the
 board if we wrote native code that uses the linux kernel interfaces.
 A big problem is that the filesystem must support non-blocking
 operations, and most don't across the board. XFS appears to though,
 so it might be worth exploring AIO on XFS. We would still want to
 cache like above hough, because the interface only works with
 unbuffered direct i/o. The big thing we would be saving is that
 context switch for the hand off.

 NIO also does some unnecessary locking due to its API design, that we
 have measured an impact for under contention. At some point we could
 consider writing a simple portable native backend for XNIO, which
 bypassed all of that. IMO we still need very good perf on standard
 NIO, so should keep the focus on that for now.

 -Jason

 _______________________________________________
 undertow-dev mailing list
 undertow-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/undertow-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [undertow-dev] File Serving Performance and Research Details