Non-deterministic packet stream corruption

Yang Zhang yanghatespam at gmail.com
Thu Jun 18 03:10:21 EDT 2009


Yang Zhang wrote:
> Hi, we have an application that uses Netty with the following pipeline:
> 
> ChannelPipeline pipeline = Channels.pipeline();
> 
> pipeline.addLast("lengthbaseddecoder",
>         new LengthFieldBasedFrameDecoder(cfg
>                 .getMaximumMessageSize(), 0, 4, 0, 4));
> pipeline.addLast("lengthprepender", new LengthFieldPrepender(4));
> 
> pipeline.addLast("protobufdecoder", new ProtobufDecoder(
>         PubSubProtocol.PubSubRequest.getDefaultInstance()));
> pipeline.addLast("protobufencoder", new ProtobufEncoder());
> 
> pipeline.addLast("executor", new ExecutionHandler(
>         new OrderedMemoryAwareThreadPoolExecutor(MAX_WORKER_THREADS,
>                 MAX_CHANNEL_MEMORY_SIZE, MAX_TOTAL_MEMORY_SIZE)));
> 
> // Dependency injection.
> pipeline.addLast("umbrellahandler", uh);
> 
> In our application, a client connects to a server and starts sending a 
> stream of messages to which the server replies with app-level acks. 
> However, if we pump messages quickly enough, then often (but not all the 
> time), we see a situation where the server is receiving a bad 
> (truncated) packet.  The stream consists of a series of app-level 
> frames: each frame should have a few bytes of header data (including a 
> 4-byte length field inserted by the LengthFieldPrepender) and a payload 
> of 1024 'a' characters, but one frame somehow ends up with fewer 'a' 
> characters (even though the length field is 1024), and so the next 
> decoded length ends up being smack in the middle of the payload of the 
> next packet, which translates into an intolerably large length ("aaaa" = 
> 0x61616161).
> 
> This error happens whether or not we specify 1 as the third parameter to 
> the constructor of NioClientSocketChannelFactory (the number of worker 
> threads).  It smells like some sort of race condition - imagine that one 
> frame being written to some low-level buffer is overwritten starting in 
> the middle by another frame.  The fact that the length is a correct 
> length (1024) suggests that the LengthPrepender and ProtobufEncoder are 
> all working properly, and that there's something deeper down the stack 
> (in Netty?) that is misbehaving.
> 
> We determined that the problem is probably originating on the client 
> since tcpdump/tcpflow is showing that the actual data stream is indeed 
> corrupted.  (Also, a separate C++ implementation of the client doesn't 
> trigger this behavior.)
> 
> We're new to using Netty and we were wondering if we were possibly doing 
> something wrong along the way.  We connect with:
> 
> private static ChannelFactory f = new 
> NioClientSocketChannelFactory(Executors.newCachedThreadPool(), 
> Executors.newCachedThreadPool());
> 
> ...
> public void connect() {
> ClientBootstrap bootstrap = new ClientBootstrap(f);
> bootstrap.setPipelineFactory(ClientChannelPipelineFactory.instance());
> bootstrap.setOption("tcpNoDelay", true);
> bootstrap.setOption("keepAlive", true);
> ChannelFuture fut = bootstrap.connect();
> fut.addListener(...);
> }
> 
> Let me know if there's any other information which may be useful. Thanks 
> in advance for any guesses.

So the more I thought about it, the more convinced I was that this was 
happening somewhere below us in Netty, since the LengthFieldPrepender 
(the lowest thing on the stack) was still prepending a correct length 
field - something underneath that was truncating the message (race 
condition somewhere, likely).

That's when I checked the version of netty we were using and found that 
we were still on ALPHA3. I upgraded to BETA3 and the bug seems to have 
gone away. However, I'm still not sure what was fixed since then. Here 
are the changelogs - I didn't spot anything that seemed even directly 
related:

https://jira.jboss.org/jira/secure/BrowseProject.jspa?id=12310721&subset=-1

Any ideas? Perhaps this one? We also had few connections (just 1) and it 
seemed like a race condition, though it's unclear how this particular 
symptom could've manifested from a failed selector wakeup (since 
subsequent bytes in the stream apparently get written without issue).

https://jira.jboss.org/jira/browse/NETTY-114
-- 
Yang Zhang
http://www.mit.edu/~y_z/



More information about the netty-users mailing list