[Design of Messaging on JBoss (Messaging/JBoss)] - To UTF-8 or not to UTF-8 that is the question - jboss-dev-forums

Thursday, 8 May 2008

JBM 1.4 uses UTF-8 encoding for all strings sent in messages, e.g. properties, text
message bodies etc.

This provides a good compression if using higher unicode characters a lot (e.g. chinese),
however the java UTF-8 encoding is *really slow*.

For JBM 2.0 we're currently the SimpleString class I wrote (which doesn't copy
itself on the drop of a hat like String) and we marshall it as a simple sequence of
bytes.

In my tests this is about 40 times faster than UTF-8 encoding the same string. :)

Problem is SimpleString currently only stores each character as two bytes, which is fine
for the vast majority of unicode characters but won't encode the far reaches of
unicode which require 4 bytes.

I can change SImpleString to use 4 bytes per character but this is going to make the
marshalled form big - especially in the case of standard latin characters or european -
about 4 times the size as encoded!

How do you think we should deal with this?

One possibility is we write our own UTF-like encoding implementation...

View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4149433#...

Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Design of Messaging on JBoss (Messaging/JBoss)] - To UTF-8 or not to UTF-8 that is the question