[jbossws-issues] [JBoss JIRA] Commented: (JBWS-1716) Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding

Thu Jun 21 16:31:51 EDT 2007

    [ http://jira.jboss.com/jira/browse/JBWS-1716?page=comments#action_12366374 ] 

floe fliep commented on JBWS-1716:
----------------------------------

streamSource.getInputStream()  appears to be an UTF-8 stream only in case the XML was serialized by the JAXBSerializer (in my case delegating to an instance of com.sun.xml.bind.v2.runtime.MarshallerImpl), so in case of sending complex XMLSchema structures.

however, when serialization goes over the SimpleSerializer such as in the case of a simple RPC string parameter, it creates a BufferedStreamResult which use the String.getBytes() conversion method, resulting in the machine default encoding being used. code snippet:

   public BufferedStreamResult(String xmlFragment)
   {     ....
         IOUtils.copyStream(getOutputStream(), new ByteArrayInputStream(xmlFragment.getBytes()));

so both serializers will output in a different encoding, yet the result will ultimately be processed by XMLFragment.writeSourceInternal(Writer writer) in one single encoding, resulting in the other one being corrupted ...

> Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding
> -----------------------------------------------------------------------------------------------
>
>                 Key: JBWS-1716
>                 URL: http://jira.jboss.com/jira/browse/JBWS-1716
>             Project: JBoss Web Services
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>    Affects Versions: jbossws-1.2.1
>            Reporter: floe fliep
>
> When sending a client request which includes a non-ASCII UTF-8 character such as the "ç" in "Français" on a machine which has the default character encoding set to something different than UTF-8, the encoding is erroneous. For example, the "ç" in the example above is marshalled on the network stream as 0xC3 0x83 0xC2 0xA7 instead of the legal UTF-8 sequence being 0xC3 0xA7, when the machine's default character set is set to MS1252 in this case (Windows).
> A fix for this is setting the system property file.encoding=utf-8, but this causes as many problems elsewhere as it fixes (especially in the case of legacy platform-specific file reading) ... .
> A forum post is highly likely to expose the same phenomenon: http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4030510#4030510
> After some good hours of stepping through the JBossWS code, I discovered what I guess must be the culprit in the method XMLFragment.writeSourceInternal(Writer writer):
>             ....
>             if (reader == null)
>                   reader = new InputStreamReader(streamSource.getInputStream());
> Here streamSource.getInputStream() is an already UTF-8 encoded stream. However, when a new instance of InputStreamReader is created around it, it will be set to the machine's default character encoding, thus effectively interpreting bytes from the UTF-8 stream in a different encoding scheme, resulting in corrupted data.
> Each time data passes through the marschalling corruption is added, effectively worsening wrong character count when data is passed back and forth.
> I would suggest attaching a reader to the StreamSource source instance var so that it keeps track of its encoding, but that might break things elsewhere ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira