On 10 October 2010 23:41, Michael Neale <michael.neale@gmail.com> wrote:
> I think you should clean room implement it (or reuse some old code of yours
> if it is safe to do so). From what I have seen of the algorithm - it isn't
> huge - and it would make sense to have it re-implemented. As an alternative
> - consider taking a look at the MVEL soundex code and rewriting that - and
> we will see if we can make it upstream.
I just re-implemented this according to the algorithm I found in
http://en.wikipedia.org/wiki/Soundex
I've also consulted a CPAN module, to learn what was intended by the
MVEL implementation, but it's undecidable (possibly due to omissions or
bugs).
> I would say it is just slightly
> neglected - its not well known that it lives there. Using the MVEL one was
> just opportunistic for drools.
> I didn't know that it could return null, that is bad. I guess if it is null
> - that would mean that you just do a literal case insensitive compare?
A correct implementation never returns null. An empty word might, but for
our purpose "" would be preferable.
> Also - AFAIK - soundex is only for english right?
Certainly.
> Is there an equivalent for other languages?
Soundex is coarse even for English. I've found the atrocious example that
the Soundex for "Britney Spears" is the same as for
"bewährten Superzicke" (~ "proven super-b*"). NYSIIS is supposed
to be better.
For German, there is an equivalent: "Kölner Phonetik". It might
make sense to provide this for an operator "soundex[de]". (All of
/M[ae][iy]e?r/ sound alike in German, and all exist as proper names.)
I have also found one link to an implementation adapted for French.
Soundex is aimed at the pronunciation of proper names. There might be some
leeway for that even in a language like Hungarian, which is pronounced exactly
as written.
I think Drools should drop the MVEL version and go for a flexible approach,
possibly even s.th. better than Soundex/NARA for English. I'll research this
some more, and report back before I commit anything ;-)
-W
> If so, perhaps having it in the drools codebase makes sense
> and opens the way for people to plug in their own soundex.
> On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <wolfgang.laun@gmail.com>
> wrote:
>>
>> The implementation of "soundslilke" is broken in more than one respect.
>> The conversion of a word to a Soundex string is provided by
>> org.mvel2.util.Soundex.
>> (.) There are words where Soundex.soundex returns null, so that the
>> calling code, in Drools, crashes with a NPE.
>> (.) The algorithm implemented in Soundex is erroneous. I'm not sure which
>> Soundex algorithm it is supposed to implement, but it just doesn't meet the
>> basic requirements.
>>
>> I have implemented, correctly, the version for the National Archives and
>> Records Administration (NARA) rule set for the official implementation of
>> Soundex used by the U.S. Government.
>>
>> Do we wait for MVEL to correct this bug, or do we just replace it with a
>> correct implementation?
>>
>> Regards
>> Wolfgang