----- Original Message -----
From: "Jason Greene" <jgreene(a)redhat.com>
To: "Andrig Miller" <anmiller(a)redhat.com>
Cc: "Jason Greene" <jason.greene(a)redhat.com>,
Sent: Monday, September 10, 2012 4:38:24 PM
Subject: Re: [undertow-dev] File Serving Performance and Research Details
On Sep 10, 2012, at 5:01 PM, Andrig Miller <anmiller(a)redhat.com>
> A question on the big memory buffer. Can that be setup to
> leverage large page memory? That would have the added
> advantage of fewer TLB misses, and it wouldn't be swappable to
> disk. Also, the newer Intel and AMD chips support 1GB HugeTLB,
> which even further reduces the TLB misses, from testing that
> has been done by Shak's team. I have yet to take another
> workload and try it yet, but we will be trying it on
> SPECjEnterprise2010 in the near future.
That's a good question. I am pretty sure the large page setting
affects non heap allocation on java but I need to look at the code
to be sure.
I know that it does the permanent generation, as people configure it wrong all the time,
and get errors because they didn't include the permananent generation in their
calculations for the number of pages (my very old JVM tuning blog posts still gets
questions even though its more than two years old).
I'm just not sure of the behavior for what you are doing here, but my guess is that it
would be, if its configured, but its just a guess at this point.
> ----- Original Message -----
>> From: "Jason Greene" <jason.greene(a)redhat.com>
>> To: undertow-dev(a)lists.jboss.org
>> Sent: Monday, September 10, 2012 3:47:32 PM
>> Subject: [undertow-dev] File Serving Performance and Research
>> Hello Everyone,
>> On and off for the past couple of weeks I have been working on the
>> file serving implementation in undertow. This lead to lots and
>> of benchmarking, which in turn lead to a lot of bug and perf fixes
>> in various areas of the web server and xnio.
>> The outcome that seems to work the best for what we have available
>> Java NIO is a caching / sendfile mix approach. The maintenance of
>> the cache is completely non-blocking and relies on a modified
>> concurrent direct deque, which lets us delete in the middle. This
>> allows an access list to be stored. In order to further reduce
>> possible contention, we sample access at 5 request intervals
>> (requests % 5 = do stuff).
>> Blocking transfer process (default)
>> 1. If entry is cached jump to non-blocking cached transfer process
>> 2. Otherwise kick off the file operations to a task on a worker
>> 3. If this is a head operation, the task simply executes a stat
>> and returns the appropriate details. (Note that stat calls can
>> [metadata read], which is why it's done the same as a transfer [in
>> 4. If the file has not been accessed at least 5 times recently, or
>> there is no cache space, or it is too big of a file, then it is
>> transferred in a blocking mode using FileChannel.transferTo, which
>> under the hood uses sendfile, or other OS file transfer optimized
>> 5. Otherwise the file is buffered and cached and then transferred
>> using scattering writes. The caching process will attempt to
>> "older" cache base following an LRU like approach.
>> Non-blocking transfer process (when cached)
>> 1. All cache entries are stored in blocks (slices) within a big
>> direct memory buffer. This uses native memory outside of the Xmx
>> settings of Java, and has the advantage that it can be written
>> directly to a socket without copying.
>> 2. When they are retrieved they are reference counted as a group
>> prevent reclamation from corrupting the to be transferred state.
>> 3. The buffers are attempted to be written in one scattering write
>> unless the socket send buffer fills.
>> 4. If the send buffer is full, an event listener is registered,
>> will be executed in async non-blocking fashion later
>> 5. The remaining portion, if any is transferred, and the ref
>> are restored
>> On a dual-core intel i7 system (Stuart's laptop), we easily get
>> 100k requests per second on small files (808 bytes) using the
>> loopback device. Testing a variety of sizes we overall push around
>> gigabytes a seconds. My older core 2 quad system (Q6700 CPU) does
>> around 80k eps and around 700-800 MB/s. There are some limitations
>> we run into with the load driver (currently using httperf).
>> can only use one CPU, so http pipelining (sending multiple
>> on the same connection) is necessary to drive that level of load.
>> Performance scales well with a large number of connections. I can
>> drive close to the same traffic with 10k connections, but the
>> connection setup time and maintenance adds a bit of cost.
>> Another interesting aspect is OS overhead. Tomaz was able to
>> his results by using an ethernet adapter over a loopback, and
>> multiple hosts. This is likely because the TCP stack was half as
>> busy. Also connection tracking in iptables has a big effect
>> 5-6%), so disabling it helps quite a bit
>> Future Research Possibilities
>> It appears we could support AIO and non-blocking logic across the
>> board if we wrote native code that uses the linux kernel
>> A big problem is that the filesystem must support non-blocking
>> operations, and most don't across the board. XFS appears to
>> so it might be worth exploring AIO on XFS. We would still want to
>> cache like above hough, because the interface only works with
>> unbuffered direct i/o. The big thing we would be saving is that
>> context switch for the hand off.
>> NIO also does some unnecessary locking due to its API design, that
>> have measured an impact for under contention. At some point we
>> consider writing a simple portable native backend for XNIO, which
>> bypassed all of that. IMO we still need very good perf on standard
>> NIO, so should keep the focus on that for now.
>> undertow-dev mailing list