[jboss-jira] [JBoss JIRA] (JGRP-1718) Message corruption under heavy load

Thu Oct 17 17:34:01 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12823049#comment-12823049 ] 

Rich DiCroce commented on JGRP-1718:
------------------------------------

I spent the whole day today trying to nail this down, and I have some more information now. Unfortunately, this information only makes the issue more baffling.

To answer your questions:
- As part of my investigation, I added some logging of the byte arrays being processed. I have confirmed that, when an error occurs, the byte array I get in my RequestHandler does not match the byte array I passed to the Message constructor. I have also determined that when a byte array gets corrupted, I can send a second Message with the exact same byte array and request options, and the byte array in the second Message is readable. I have also made sure I can deserialize an object immediately after serializing it (before sending the Message), and I have also tried deserializing a second time after the first time failed (when receiving the Message; the second time also fails).
- I tried using normal Java serialization, and that does not seem to exhibit the problem. This does not necessarily implicate Kryo or exhonerate JGroups, however. See below.
- Kryo has input/output buffers, specifically, the Input and Output classes. I am not reusing instances of those classes. I am reusing instances of the Kryo class itself, which handles the actual serialization/deserialization. Kryo is not thread-safe, so I am obtaining the instances from a ThreadLocal to prevent concurrent use.

I have determined that the byte arrays are being corrupted in a way that's too consistent to be random chance. It seems to be triggered by a particular bit sequence, specifically:

{code}00000001{code} followed by {code}11111001{code} or {code}11100101{code}

These two bytes can be anywhere in the array. The corruption that occurs is always the same: the first bit of the second byte is flipped from 1 to 0. Unfortunately, these bit sequences only seem to be part of the cause. I believe a race condition of some sort is also at work here, because I've been unable to create a simple test that reproduces the problem.

Another twist that only makes the issue weirder: changing the loopback setting of the UDP transport affects the frequency of the corruption. With loopback = false, the problem occurs much more rarely, to the point that I initially thought it had eliminated the problem altogether. I only observed one corruption during my testing with loopback = false, and curiously, a third bit pattern that I have not observed with loopback = true got corrupted:

{code}00000001 11000100{code}

Again, the first bit of the second byte was flipped from 1 to 0. I also logged the number of nodes that were unable to read the message, and there was only one error despite the cluster having two nodes and the message being a multicast to all nodes. This implies that the corruption occurs when the message is received, not when it is sent.

Having looked at the JGroups source code, it seems like there would be less concurrency with loopback = false since a multicast message would not be immediately sent back up the stack, potentially in parallel with messages being received from other nodes. Hence the speculation about a race condition.

Does any of this make any sense to you? Because my brain feels like it's about to melt. :-)

> Message corruption under heavy load
> -----------------------------------
>
>                 Key: JGRP-1718
>                 URL: https://issues.jboss.org/browse/JGRP-1718
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.2.12
>         Environment: JBoss AS 7.2.0.Final (includes JGroups 3.2.7.Final, upgrading to 3.2.12.Final had no effect)
>            Reporter: Rich DiCroce
>            Assignee: Bela Ban
>
> In my project, I'm using the Kryo serialization library to serialize/deserialize objects to/from byte[]. Some of these byte arrays appear to be getting corrupted during transmission.
> The problem happens very infrequently and only under load. I have to send tens of thousands of messages from hundreds of threads just to get two or three errors. The exception occurs when Kryo tries to deserialize the byte[] on the receiving node. So far, all of the errors I've seen happen when Kryo tries to find a class to deserialize an object, and the stack traces end with something like:
> Caused by: java.lang.ClassNotFoundException: java.util.Date[SOH][GS]
> Note that the [SOH] and [GS] are not literally those strings, they're unprintable ASCII 0x1 and 0x1D.
> I'm confident that Kryo itself is not the source of the problem. I added some code to my project to deserialize the message on the sending node, immediately after serializing it, and log any exception thrown from the deserialization. The sending node did not log an exception, but the receiving node still did. My code passes the byte[] to the Message constructor without doing anything to it.
> It also appears that I am not the first person to encounter this issue. See https://code.google.com/p/kryo/issues/detail?id=102

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira