New subject: [JBoss JIRA] Commented: (JBWS-1716) Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding

Thursday, 21 June 2007

Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default
encoding
-----------------------------------------------------------------------------------------------

                 Key: JBWS-1716
                 URL: http://jira.jboss.com/jira/browse/JBWS-1716
             Project: JBoss Web Services
          Issue Type: Bug
      Security Level: Public (Everyone can see)
    Affects Versions: jbossws-1.2.1
            Reporter: floe fliep

When sending a client request which includes a non-ASCII UTF-8 character such as the
"ç" in "Français" on a machine which has the default character
encoding set to something different than UTF-8, the encoding is erroneous. For example,
the "ç" in the example above is marshalled on the network stream as 0xC3 0x83
0xC2 0xA7 instead of the legal UTF-8 sequence being 0xC3 0xA7, when the machine's
default character set is set to MS1252 in this case (Windows).

A fix for this is setting the system property file.encoding=utf-8, but this causes as many
problems elsewhere as it fixes (especially in the case of legacy platform-specific file
reading) ... .

A forum post is highly likely to expose the same phenomenon:
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4030510#...

After some good hours of stepping through the JBossWS code, I discovered what I guess must
be the culprit in the method XMLFragment.writeSourceInternal(Writer writer):
            ....
            if (reader == null)
                  reader = new InputStreamReader(streamSource.getInputStream());

Here streamSource.getInputStream() is an already UTF-8 encoded stream. However, when a new
instance of InputStreamReader is created around it, it will be set to the machine's
default character encoding, thus effectively interpreting bytes from the UTF-8 stream in a
different encoding scheme, resulting in corrupted data.

Each time data passes through the marschalling corruption is added, effectively worsening
wrong character count when data is passed back and forth.

I would suggest attaching a reader to the StreamSource source instance var so that it
keeps track of its encoding, but that might break things elsewhere ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JBWS-1716) Erroneous UTF-8 character encoding when marshalling on machines with non-UTF-8 default encoding