[jbossws-issues] [JBoss JIRA] Commented: (JBWS-1716) Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding

Thursday, 21 June 2007

    [ http://jira.jboss.com/jira/browse/JBWS-1716?page=comments#action_12366374 ] 

floe fliep commented on JBWS-1716:
----------------------------------

streamSource.getInputStream()  appears to be an UTF-8 stream only in case the XML was
serialized by the JAXBSerializer (in my case delegating to an instance of
com.sun.xml.bind.v2.runtime.MarshallerImpl), so in case of sending complex XMLSchema
structures.

however, when serialization goes over the SimpleSerializer such as in the case of a simple
RPC string parameter, it creates a BufferedStreamResult which use the String.getBytes()
conversion method, resulting in the machine default encoding being used. code snippet:

   public BufferedStreamResult(String xmlFragment)
   {     ....
         IOUtils.copyStream(getOutputStream(), new
ByteArrayInputStream(xmlFragment.getBytes()));

so both serializers will output in a different encoding, yet the result will ultimately be
processed by XMLFragment.writeSourceInternal(Writer writer) in one single encoding,
resulting in the other one being corrupted ...

...
 Erroneous UTF-8 character encoding when marshalling on machines with
non-UTF-8 default encoding

-----------------------------------------------------------------------------------------------

                 Key: JBWS-1716
                 URL: http://jira.jboss.com/jira/browse/JBWS-1716
             Project: JBoss Web Services
          Issue Type: Bug
      Security Level: Public(Everyone can see) 
    Affects Versions: jbossws-1.2.1
            Reporter: floe fliep

 When sending a client request which includes a non-ASCII UTF-8 character such as the
"ç" in "Français" on a machine which has the default character
encoding set to something different than UTF-8, the encoding is erroneous. For example,
the "ç" in the example above is marshalled on the network stream as 0xC3 0x83
0xC2 0xA7 instead of the legal UTF-8 sequence being 0xC3 0xA7, when the machine's
default character set is set to MS1252 in this case (Windows).
 A fix for this is setting the system property file.encoding=utf-8, but this causes as
many problems elsewhere as it fixes (especially in the case of legacy platform-specific
file reading) ... .
 A forum post is highly likely to expose the same phenomenon:
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4030510#...
 After some good hours of stepping through the JBossWS code, I discovered what I guess
must be the culprit in the method XMLFragment.writeSourceInternal(Writer writer):
             ....
             if (reader == null)
                   reader = new InputStreamReader(streamSource.getInputStream());
 Here streamSource.getInputStream() is an already UTF-8 encoded stream. However, when a
new instance of InputStreamReader is created around it, it will be set to the
machine's default character encoding, thus effectively interpreting bytes from the
UTF-8 stream in a different encoding scheme, resulting in corrupted data.
 Each time data passes through the marschalling corruption is added, effectively worsening
wrong character count when data is passed back and forth.
 I would suggest attaching a reader to the StreamSource source instance var so that it
keeps track of its encoding, but that might break things elsewhere ... 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jbossws-issues] [JBoss JIRA] Commented: (JBWS-1716) Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding