On 10 October 2010 23:41, Michael Neale <michael.neale(a)gmail.com> wrote:
I think you should clean room implement it (or reuse some old code of
yours
if it is safe to do so). From what I have seen of the algorithm - it
isn't
huge - and it would make sense to have it re-implemented. As an
alternative
- consider taking a look at the MVEL soundex code and rewriting that
- and
we will see if we can make it upstream.
I just re-implemented this according to the algorithm I found in
http://en.wikipedia.org/wiki/Soundex
I've also consulted a CPAN module, to learn what was intended by the
MVEL implementation, but it's undecidable (possibly due to omissions or
bugs).
I would say it is just slightly
neglected - its not well known that it lives there. Using the MVEL one
was
just opportunistic for drools.
I didn't know that it could return null, that is bad. I guess if it is
null
- that would mean that you just do a literal case insensitive
compare?
A correct implementation never returns null. An empty word might, but for
our purpose "" would be preferable.
Also - AFAIK - soundex is only for english right?
Certainly.
Is there an equivalent for other languages?
Soundex is coarse
even for English. I've found the atrocious example that
the Soundex for "Britney Spears" is the same as for
"bewährten Superzicke" (~ "proven super-b*").
NYSIIS<http://en.wikipedia.org/wiki/New_York_State_Identification_and_...
supposed
to be better.
For German, there is an equivalent: "Kölner Phonetik". It might
make sense to provide this for an operator "soundex[de]". (All of
/M[ae][iy]e?r/ sound alike in German, and all exist as proper names.)
I have also found one link to an implementation adapted for French.
Soundex is aimed at the pronunciation of proper names. There might be some
leeway for that even in a language like Hungarian, which is pronounced
exactly
as written.
I think Drools should drop the MVEL version and go for a flexible approach,
possibly even s.th. better than Soundex/NARA for English. I'll research this
some more, and report back before I commit anything ;-)
-W
If so, perhaps having it in the drools codebase makes sense
and opens the way for people to plug in their own soundex.
On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <wolfgang.laun(a)gmail.com>
wrote:
>
> The implementation of "soundslilke" is broken in more than one respect.
> The conversion of a word to a Soundex string is provided by
> org.mvel2.util.Soundex.
> (.) There are words where Soundex.soundex returns null, so that the
> calling code, in Drools, crashes with a NPE.
> (.) The algorithm implemented in Soundex is erroneous. I'm not sure which
> Soundex algorithm it is supposed to implement, but it just doesn't meet
the
> basic requirements.
>
> I have implemented, correctly, the version for the National Archives and
> Records Administration (NARA) rule set for the official implementation of
> Soundex used by the U.S. Government.
>
> Do we wait for MVEL to correct this bug, or do we just replace it with a
> correct implementation?
>
> Regards
> Wolfgang