Non-deterministic packet stream corruption

"이희승 (Trustin Lee)" trustin at gmail.com
Thu Jun 18 03:59:01 EDT 2009


On 2009-06-18 오후 4:10, Yang Zhang wrote:
> Yang Zhang wrote:
>> Hi, we have an application that uses Netty with the following pipeline:
>>
>> ChannelPipeline pipeline = Channels.pipeline();
>>
>> pipeline.addLast("lengthbaseddecoder",
>>         new LengthFieldBasedFrameDecoder(cfg
>>                 .getMaximumMessageSize(), 0, 4, 0, 4));
>> pipeline.addLast("lengthprepender", new LengthFieldPrepender(4));
>>
>> pipeline.addLast("protobufdecoder", new ProtobufDecoder(
>>         PubSubProtocol.PubSubRequest.getDefaultInstance()));
>> pipeline.addLast("protobufencoder", new ProtobufEncoder());
>>
>> pipeline.addLast("executor", new ExecutionHandler(
>>         new OrderedMemoryAwareThreadPoolExecutor(MAX_WORKER_THREADS,
>>                 MAX_CHANNEL_MEMORY_SIZE, MAX_TOTAL_MEMORY_SIZE)));
>>
>> // Dependency injection.
>> pipeline.addLast("umbrellahandler", uh);
>>
>> In our application, a client connects to a server and starts sending a
>> stream of messages to which the server replies with app-level acks.
>> However, if we pump messages quickly enough, then often (but not all
>> the time), we see a situation where the server is receiving a bad
>> (truncated) packet.  The stream consists of a series of app-level
>> frames: each frame should have a few bytes of header data (including a
>> 4-byte length field inserted by the LengthFieldPrepender) and a
>> payload of 1024 'a' characters, but one frame somehow ends up with
>> fewer 'a' characters (even though the length field is 1024), and so
>> the next decoded length ends up being smack in the middle of the
>> payload of the next packet, which translates into an intolerably large
>> length ("aaaa" = 0x61616161).
>>
>> This error happens whether or not we specify 1 as the third parameter
>> to the constructor of NioClientSocketChannelFactory (the number of
>> worker threads).  It smells like some sort of race condition - imagine
>> that one frame being written to some low-level buffer is overwritten
>> starting in the middle by another frame.  The fact that the length is
>> a correct length (1024) suggests that the LengthPrepender and
>> ProtobufEncoder are all working properly, and that there's something
>> deeper down the stack (in Netty?) that is misbehaving.
>>
>> We determined that the problem is probably originating on the client
>> since tcpdump/tcpflow is showing that the actual data stream is indeed
>> corrupted.  (Also, a separate C++ implementation of the client doesn't
>> trigger this behavior.)
>>
>> We're new to using Netty and we were wondering if we were possibly
>> doing something wrong along the way.  We connect with:
>>
>> private static ChannelFactory f = new
>> NioClientSocketChannelFactory(Executors.newCachedThreadPool(),
>> Executors.newCachedThreadPool());
>>
>> ...
>> public void connect() {
>> ClientBootstrap bootstrap = new ClientBootstrap(f);
>> bootstrap.setPipelineFactory(ClientChannelPipelineFactory.instance());
>> bootstrap.setOption("tcpNoDelay", true);
>> bootstrap.setOption("keepAlive", true);
>> ChannelFuture fut = bootstrap.connect();
>> fut.addListener(...);
>> }
>>
>> Let me know if there's any other information which may be useful.
>> Thanks in advance for any guesses.
> 
> So the more I thought about it, the more convinced I was that this was
> happening somewhere below us in Netty, since the LengthFieldPrepender
> (the lowest thing on the stack) was still prepending a correct length
> field - something underneath that was truncating the message (race
> condition somewhere, likely).
> 
> That's when I checked the version of netty we were using and found that
> we were still on ALPHA3. I upgraded to BETA3 and the bug seems to have
> gone away. However, I'm still not sure what was fixed since then. Here
> are the changelogs - I didn't spot anything that seemed even directly
> related:
> 
> https://jira.jboss.org/jira/secure/BrowseProject.jspa?id=12310721&subset=-1
> 
> Any ideas? Perhaps this one? We also had few connections (just 1) and it
> seemed like a race condition, though it's unclear how this particular
> symptom could've manifested from a failed selector wakeup (since
> subsequent bytes in the stream apparently get written without issue).
> 
> https://jira.jboss.org/jira/browse/NETTY-114

I made various changes inside the core, but some of them were not
properly documented (because I was too lazy at that moment.)  So.. I
guess the fix might have been applied with no assigned JIRA issue.

Anyway, good to hear that the bug is gone now.  Phew!

-- 
— Trustin Lee, http://gleamynode.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 258 bytes
Desc: OpenPGP digital signature
Url : http://lists.jboss.org/pipermail/netty-users/attachments/20090618/84312fcf/attachment.bin 


More information about the netty-users mailing list