[jboss-jira] [JBoss JIRA] (JBRULES-2936) Importing decision table from Excel: Non Ascii chars should not be corrupted

Jesper S. Møller (Commented) (JIRA) jira-events at lists.jboss.org
Mon Jan 2 21:40:09 EST 2012


    [ https://issues.jboss.org/browse/JBRULES-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653417#comment-12653417 ] 

Jesper S. Møller commented on JBRULES-2936:
-------------------------------------------

I was bitten by this too, but a bit of digging solved the mystery:

JXL gets this wrong, at least for BIFF8 files (binary Excel 97 files). They store all strings in the Shared String Table (SST), and always as Unicode. For strings where all the high-order-bytes are 0x00, they use a compressed format, where they leave all the high-order bytes out. JXL wrongly tries to tackle those as though they were MBCS using the system.encoding (or jxl.encoding). This is very wrong. POI gets it right. Hardcoding jxl.encoding to ISO-8859-1 (a.k.a. Windows-1252) fixes JXL for you - for BIFF8 and up. This is because code points U+00 - U+FF is exactly ISO-8859-1.

For versions prior to Excel 97 (BIFF8), string values were stored in LABEL followed the codepage record which is stored in the file. I don't see neither POI or JXL doing this right when reading an Excel 95 file, but it requires a "codepage number -> Java encoding" table to get right, which is hard work.
But the JXL version I tried had a different bug in reading Excel95 files (containing WRITEACCESS records, which I guess are common), so that's likely not a problem at all.

Yeah, yeah, one day I'll do "The OSS Right Thing" and produce a proper patch for JXL and POI.
For now, set the system properties (System.setProperty("jxl.encoding", "ISO-8859-1");) and enjoy your decision tables!

Note how this problem is unlikely to hit you if you
A) Primarily use English text
B) Use an 8-bit "file.encoding" which has all the characters you need in the same place as ISO-8859-1 (i.e. Windows for most Europeans)

Diversity matters! (I say from Denmark running Mac OS X)
                
> Importing decision table from Excel: Non Ascii chars should not be corrupted
> ----------------------------------------------------------------------------
>
>                 Key: JBRULES-2936
>                 URL: https://issues.jboss.org/browse/JBRULES-2936
>             Project: Drools
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>            Reporter: Geoffrey De Smet
>             Fix For: 5.4.0.Beta2
>
>
> see
> http://stackoverflow.com/questions/5298748/guvnor-rules-encoding
> Excel (like windows) probably has crappy encoding standardization (as in none at all), so I suspect that we 'll need to ask the excel document what encoding (or even what locale) it is and read the data in that encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       



More information about the jboss-jira mailing list