[jboss-jira] [JBoss JIRA] (JBRULES-2936) Importing decision table from Excel: Non Ascii chars should not be corrupted
Jesper S. Møller (Issue Comment Edited) (JIRA)
jira-events at lists.jboss.org
Mon Jan 2 21:46:09 EST 2012
[ https://issues.jboss.org/browse/JBRULES-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653417#comment-12653417 ]
Jesper S. Møller edited comment on JBRULES-2936 at 1/2/12 9:44 PM:
-------------------------------------------------------------------
I was bitten by this too, but a bit of digging solved the mystery:
JXL gets this wrong, at least for BIFF8 files (binary Excel 97 files). They are not environment-dependent at all.
Excel stores all strings in the Shared String Table (SST) record, and always as Unicode. For strings where all the high-order-bytes are 0x00, Excel uses a compressed format, where they leave all the high-order bytes out. JXL wrongly tries to tackle those as though they were MBCS using the file.encoding (or jxl.encoding, if set). This is very wrong. (POI gets it right. BTW)
Hardcoding "jxl.encoding" to ISO-8859-1 fixes JXL for you - for BIFF8 and up. This is because Unicode code points U+0000 - U+00FF is exactly ISO-8859-1.
For versions prior to Excel 97 (BIFF8), string values were stored in LABEL records, according to the CODEPAGE record (which is in fact stored in the file). I don't see neither POI or JXL doing this quite right when reading an Excel 95 file, but it would require a "codepage number -> Java encoding" table, which is hard work.
Additionally, the JXL version I tried had a different bug in reading Excel95 files (containing WRITEACCESS records), so that's likely not a very popular file format.
Yeah, yeah, one day I'll do "The OSS Right Thing" and produce a proper patch for JXL and perhaps POI.
For now, set the system properties (System.setProperty("jxl.encoding", "ISO-8859-1");) and enjoy your decision tables!
Note how this problem is unlikely to hit you if you
A) Primarily use English text
B) Use an 8-bit "file.encoding" which has all the characters you need in the same place as ISO-8859-1 (i.e. Windows for most Europeans)
Diversity matters! (I say from Denmark running Mac OS X)
was (Author: jespersm):
I was bitten by this too, but a bit of digging solved the mystery:
JXL gets this wrong, at least for BIFF8 files (binary Excel 97 files). They store all strings in the Shared String Table (SST), and always as Unicode. For strings where all the high-order-bytes are 0x00, they use a compressed format, where they leave all the high-order bytes out. JXL wrongly tries to tackle those as though they were MBCS using the system.encoding (or jxl.encoding). This is very wrong. POI gets it right. Hardcoding jxl.encoding to ISO-8859-1 (a.k.a. Windows-1252) fixes JXL for you - for BIFF8 and up. This is because code points U+00 - U+FF is exactly ISO-8859-1.
For versions prior to Excel 97 (BIFF8), string values were stored in LABEL followed the codepage record which is stored in the file. I don't see neither POI or JXL doing this right when reading an Excel 95 file, but it requires a "codepage number -> Java encoding" table to get right, which is hard work.
But the JXL version I tried had a different bug in reading Excel95 files (containing WRITEACCESS records, which I guess are common), so that's likely not a problem at all.
Yeah, yeah, one day I'll do "The OSS Right Thing" and produce a proper patch for JXL and POI.
For now, set the system properties (System.setProperty("jxl.encoding", "ISO-8859-1");) and enjoy your decision tables!
Note how this problem is unlikely to hit you if you
A) Primarily use English text
B) Use an 8-bit "file.encoding" which has all the characters you need in the same place as ISO-8859-1 (i.e. Windows for most Europeans)
Diversity matters! (I say from Denmark running Mac OS X)
> Importing decision table from Excel: Non Ascii chars should not be corrupted
> ----------------------------------------------------------------------------
>
> Key: JBRULES-2936
> URL: https://issues.jboss.org/browse/JBRULES-2936
> Project: Drools
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Reporter: Geoffrey De Smet
> Fix For: 5.4.0.Beta2
>
>
> see
> http://stackoverflow.com/questions/5298748/guvnor-rules-encoding
> Excel (like windows) probably has crappy encoding standardization (as in none at all), so I suspect that we 'll need to ask the excel document what encoding (or even what locale) it is and read the data in that encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list