[rules-dev] soundslike broken

Michael Neale michael.neale at gmail.com
Sun Oct 10 23:57:54 EDT 2010


great - I guess if it shifts away from "fixed" soundex - probably should try
and find out who is using it to ensure there are no surprises. I can't
imagine it is widely used.

On Mon, Oct 11, 2010 at 2:43 PM, Wolfgang Laun <wolfgang.laun at gmail.com>wrote:

> On 10 October 2010 23:41, Michael Neale <michael.neale at gmail.com> wrote:
> > I think you should clean room implement it (or reuse some old code of
> yours
> > if it is safe to do so). From what I have seen of the algorithm - it
> isn't
> > huge - and it would make sense to have it re-implemented. As an
> alternative
> > - consider taking a look at the MVEL soundex code and rewriting that -
> and
> > we will see if we can make it upstream.
>
> I just re-implemented this according to the algorithm I found in
>  http://en.wikipedia.org/wiki/Soundex
> I've also consulted a CPAN module, to learn what was intended by the
> MVEL implementation, but it's undecidable (possibly due to omissions or
> bugs).
>
>
> > I would say it is just slightly
> > neglected  - its not well known that it lives there. Using the MVEL one
> was
> > just opportunistic for drools.
> > I didn't know that it could return null, that is bad. I guess if it is
> null
> > - that would mean that you just do a literal case insensitive compare?
>
> A correct implementation never returns null. An empty word might, but for
> our purpose "" would be preferable.
>
>
> > Also - AFAIK - soundex is only for english right?
> Certainly.
>
>
> > Is there an equivalent for other languages?
> Soundex is coarse even for English. I've found the atrocious example that
> the Soundex for "Britney Spears" is the same as for
> "bewährten Superzicke" (~ "proven super-b*"). NYSIIS<http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System>is supposed
> to be better.
>
> For German, there is an equivalent: "Kölner Phonetik". It might
> make sense to provide this for an operator "soundex[de]". (All of
> /M[ae][iy]e?r/ sound alike in German, and all exist as proper names.)
>
> I have also found one link to an implementation adapted for French.
>
> Soundex is aimed at the pronunciation of proper names. There might be some
> leeway for that even in a language like Hungarian, which is pronounced
> exactly
> as written.
>
> I think Drools should drop the MVEL version and go for a flexible approach,
> possibly even s.th. better than Soundex/NARA for English. I'll research
> this
> some more, and report back before I commit anything ;-)
>
> -W
>
>
>
> > If so, perhaps having it in the drools codebase makes sense
> > and opens the way for people to plug in their own soundex.
> > On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <wolfgang.laun at gmail.com>
> > wrote:
> >>
> >> The implementation of "soundslilke" is broken in more than one respect.
> >> The conversion of a word to a Soundex string is provided by
> >> org.mvel2.util.Soundex.
> >> (.) There are words where Soundex.soundex returns null, so that the
> >> calling code, in Drools, crashes with a NPE.
> >> (.) The algorithm implemented in Soundex is erroneous. I'm not sure
> which
> >> Soundex algorithm it is supposed to implement, but it just doesn't meet
> the
> >> basic requirements.
> >>
> >> I have implemented, correctly, the version for the National Archives and
> >> Records Administration (NARA) rule set for the official implementation
> of
> >> Soundex used by the U.S. Government.
> >>
> >> Do we wait for MVEL to correct this bug, or do we just replace it with a
> >> correct implementation?
> >>
> >> Regards
> >> Wolfgang
>
>
> _______________________________________________
> rules-dev mailing list
> rules-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/rules-dev
>
>


-- 
Michael D Neale
home: www.michaelneale.net
blog: michaelneale.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/rules-dev/attachments/20101011/ac8c250a/attachment.html 


More information about the rules-dev mailing list