So just a quick update on what we have, and where I think we should be
going in the near future.
This currently provides basic HTTP/1.1 functionality, which is achieved
by chaining together various handlers. It has basic support for Cookies,
Sessions, file serving, form data parsing and a few other things.
I think the short term priorities here should be:
- Add initial SSL support.
- Multipart upload handling - We need a MIME parser that can handle
Buffers and Channels rather than Streams.
- Improve session handling - The current implementation is very basic)
- Initial security implementation - This is a big topic, I will send an
email about this to the list shortly
Some basic servlet functionality is implemented, basically Servlets,
Filters, Listeners and static resource serving. This implementation is
far from complete though, for example a large number of the methods on
HttpServletRequest/Response are not functional yet. As much as possible
this functionality is just a lightweight wrapper around the underlying
functionality in the core server.
My basic plan here is to try and make servlet functionality, and core
server functionality required for servlet a priority. Without servlet we
are unlikely to get any community involvement. Also once we have a
decent level of servlet functionality we can start TCK testing, which
will likely show up a lot of issues.
This still early days, but as most of the metadata parsing and
annotation handling code is the same we can actually deploy and run some
of the AS7 quick starts (at least helloworld-html5 and helloworld-gwt).
I think this should also be a priority, as it is required to run Servlet
apps. As the core deployment code should not change much from what we
already have I think most of the work here is actually in the management
side of things.
This has not started yet, however as it is necessary for JSF I think we
should attempt to get basic support up and running sooner rather that
later. I think the best approach here is to just take our existing JSP
implementation from JBoss Web. I want to keep this in a separate
repository, and if possible not tie our servlet implementation to this,
so we have as little JSP code as possible in the servlet implementation.
On and off for the past couple of weeks I have been working on the file serving implementation in undertow. This lead to lots and lots of benchmarking, which in turn lead to a lot of bug and perf fixes in various areas of the web server and xnio.
The outcome that seems to work the best for what we have available in Java NIO is a caching / sendfile mix approach. The maintenance of the cache is completely non-blocking and relies on a modified concurrent direct deque, which lets us delete in the middle. This allows an access list to be stored. In order to further reduce possible contention, we sample access at 5 request intervals (requests % 5 = do stuff).
Blocking transfer process (default)
1. If entry is cached jump to non-blocking cached transfer process
2. Otherwise kick off the file operations to a task on a worker thread
3. If this is a head operation, the task simply executes a stat call and returns the appropriate details. (Note that stat calls can block [metadata read], which is why it's done the same as a transfer [in a workerthread])
4. If the file has not been accessed at least 5 times recently, or there is no cache space, or it is too big of a file, then it is then transferred in a blocking mode using FileChannel.transferTo, which under the hood uses sendfile, or other OS file transfer optimized calls.
5. Otherwise the file is buffered and cached and then transferred using scattering writes. The caching process will attempt to reclaim "older" cache base following an LRU like approach.
Non-blocking transfer process (when cached)
1. All cache entries are stored in blocks (slices) within a big direct memory buffer. This uses native memory outside of the Xmx settings of Java, and has the advantage that it can be written directly to a socket without copying.
2. When they are retrieved they are reference counted as a group to prevent reclamation from corrupting the to be transferred state.
3. The buffers are attempted to be written in one scattering write unless the socket send buffer fills.
4. If the send buffer is full, an event listener is registered, and will be executed in async non-blocking fashion later
5. The remaining portion, if any is transferred, and the ref counts are restored
On a dual-core intel i7 system (Stuart's laptop), we easily get over 100k requests per second on small files (808 bytes) using the loopback device. Testing a variety of sizes we overall push around 1 gigabytes a seconds. My older core 2 quad system (Q6700 CPU) does around 80k eps and around 700-800 MB/s. There are some limitations we run into with the load driver (currently using httperf). Httperf can only use one CPU, so http pipelining (sending multiple requests on the same connection) is necessary to drive that level of load. Performance scales well with a large number of connections. I can drive close to the same traffic with 10k connections, but the connection setup time and maintenance adds a bit of cost.
Another interesting aspect is OS overhead. Tomaz was able to improve his results by using an ethernet adapter over a loopback, and multiple hosts. This is likely because the TCP stack was half as busy. Also connection tracking in iptables has a big effect (almost 5-6%), so disabling it helps quite a bit
Future Research Possibilities
It appears we could support AIO and non-blocking logic across the board if we wrote native code that uses the linux kernel interfaces. A big problem is that the filesystem must support non-blocking operations, and most don't across the board. XFS appears to though, so it might be worth exploring AIO on XFS. We would still want to cache like above hough, because the interface only works with unbuffered direct i/o. The big thing we would be saving is that context switch for the hand off.
NIO also does some unnecessary locking due to its API design, that we have measured an impact for under contention. At some point we could consider writing a simple portable native backend for XNIO, which bypassed all of that. IMO we still need very good perf on standard NIO, so should keep the focus on that for now.