Non-deterministic packet stream corruption
Yang Zhang
yanghatespam at gmail.com
Thu Jun 18 03:10:21 EDT 2009
Yang Zhang wrote:
> Hi, we have an application that uses Netty with the following pipeline:
>
> ChannelPipeline pipeline = Channels.pipeline();
>
> pipeline.addLast("lengthbaseddecoder",
> new LengthFieldBasedFrameDecoder(cfg
> .getMaximumMessageSize(), 0, 4, 0, 4));
> pipeline.addLast("lengthprepender", new LengthFieldPrepender(4));
>
> pipeline.addLast("protobufdecoder", new ProtobufDecoder(
> PubSubProtocol.PubSubRequest.getDefaultInstance()));
> pipeline.addLast("protobufencoder", new ProtobufEncoder());
>
> pipeline.addLast("executor", new ExecutionHandler(
> new OrderedMemoryAwareThreadPoolExecutor(MAX_WORKER_THREADS,
> MAX_CHANNEL_MEMORY_SIZE, MAX_TOTAL_MEMORY_SIZE)));
>
> // Dependency injection.
> pipeline.addLast("umbrellahandler", uh);
>
> In our application, a client connects to a server and starts sending a
> stream of messages to which the server replies with app-level acks.
> However, if we pump messages quickly enough, then often (but not all the
> time), we see a situation where the server is receiving a bad
> (truncated) packet. The stream consists of a series of app-level
> frames: each frame should have a few bytes of header data (including a
> 4-byte length field inserted by the LengthFieldPrepender) and a payload
> of 1024 'a' characters, but one frame somehow ends up with fewer 'a'
> characters (even though the length field is 1024), and so the next
> decoded length ends up being smack in the middle of the payload of the
> next packet, which translates into an intolerably large length ("aaaa" =
> 0x61616161).
>
> This error happens whether or not we specify 1 as the third parameter to
> the constructor of NioClientSocketChannelFactory (the number of worker
> threads). It smells like some sort of race condition - imagine that one
> frame being written to some low-level buffer is overwritten starting in
> the middle by another frame. The fact that the length is a correct
> length (1024) suggests that the LengthPrepender and ProtobufEncoder are
> all working properly, and that there's something deeper down the stack
> (in Netty?) that is misbehaving.
>
> We determined that the problem is probably originating on the client
> since tcpdump/tcpflow is showing that the actual data stream is indeed
> corrupted. (Also, a separate C++ implementation of the client doesn't
> trigger this behavior.)
>
> We're new to using Netty and we were wondering if we were possibly doing
> something wrong along the way. We connect with:
>
> private static ChannelFactory f = new
> NioClientSocketChannelFactory(Executors.newCachedThreadPool(),
> Executors.newCachedThreadPool());
>
> ...
> public void connect() {
> ClientBootstrap bootstrap = new ClientBootstrap(f);
> bootstrap.setPipelineFactory(ClientChannelPipelineFactory.instance());
> bootstrap.setOption("tcpNoDelay", true);
> bootstrap.setOption("keepAlive", true);
> ChannelFuture fut = bootstrap.connect();
> fut.addListener(...);
> }
>
> Let me know if there's any other information which may be useful. Thanks
> in advance for any guesses.
So the more I thought about it, the more convinced I was that this was
happening somewhere below us in Netty, since the LengthFieldPrepender
(the lowest thing on the stack) was still prepending a correct length
field - something underneath that was truncating the message (race
condition somewhere, likely).
That's when I checked the version of netty we were using and found that
we were still on ALPHA3. I upgraded to BETA3 and the bug seems to have
gone away. However, I'm still not sure what was fixed since then. Here
are the changelogs - I didn't spot anything that seemed even directly
related:
https://jira.jboss.org/jira/secure/BrowseProject.jspa?id=12310721&subset=-1
Any ideas? Perhaps this one? We also had few connections (just 1) and it
seemed like a race condition, though it's unclear how this particular
symptom could've manifested from a failed selector wakeup (since
subsequent bytes in the stream apparently get written without issue).
https://jira.jboss.org/jira/browse/NETTY-114
--
Yang Zhang
http://www.mit.edu/~y_z/
More information about the netty-users
mailing list