[rules-dev] soundslike broken

Wolfgang Laun wolfgang.laun at gmail.com
Mon Oct 11 00:07:43 EDT 2010


On 11 October 2010 05:57, Michael Neale <michael.neale at gmail.com> wrote:

> great - I guess if it shifts away from "fixed" soundex - probably should
> try and find out who is using it to ensure there are no surprises. I can't
> imagine it is widely used.
>

Neither do I - you should have seen some complaints, then.
-W



>
> On Mon, Oct 11, 2010 at 2:43 PM, Wolfgang Laun <wolfgang.laun at gmail.com>wrote:
>
>> On 10 October 2010 23:41, Michael Neale <michael.neale at gmail.com> wrote:
>> > I think you should clean room implement it (or reuse some old code of
>> yours
>> > if it is safe to do so). From what I have seen of the algorithm - it
>> isn't
>> > huge - and it would make sense to have it re-implemented. As an
>> alternative
>> > - consider taking a look at the MVEL soundex code and rewriting that -
>> and
>> > we will see if we can make it upstream.
>>
>> I just re-implemented this according to the algorithm I found in
>>  http://en.wikipedia.org/wiki/Soundex
>> I've also consulted a CPAN module, to learn what was intended by the
>> MVEL implementation, but it's undecidable (possibly due to omissions or
>> bugs).
>>
>>
>> > I would say it is just slightly
>> > neglected  - its not well known that it lives there. Using the MVEL one
>> was
>> > just opportunistic for drools.
>> > I didn't know that it could return null, that is bad. I guess if it is
>> null
>> > - that would mean that you just do a literal case insensitive compare?
>>
>> A correct implementation never returns null. An empty word might, but for
>> our purpose "" would be preferable.
>>
>>
>> > Also - AFAIK - soundex is only for english right?
>> Certainly.
>>
>>
>> > Is there an equivalent for other languages?
>> Soundex is coarse even for English. I've found the atrocious example that
>> the Soundex for "Britney Spears" is the same as for
>> "bewährten Superzicke" (~ "proven super-b*"). NYSIIS<http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System>is supposed
>> to be better.
>>
>> For German, there is an equivalent: "Kölner Phonetik". It might
>> make sense to provide this for an operator "soundex[de]". (All of
>> /M[ae][iy]e?r/ sound alike in German, and all exist as proper names.)
>>
>> I have also found one link to an implementation adapted for French.
>>
>> Soundex is aimed at the pronunciation of proper names. There might be some
>> leeway for that even in a language like Hungarian, which is pronounced
>> exactly
>> as written.
>>
>> I think Drools should drop the MVEL version and go for a flexible
>> approach,
>> possibly even s.th. better than Soundex/NARA for English. I'll research
>> this
>> some more, and report back before I commit anything ;-)
>>
>> -W
>>
>>
>>
>> > If so, perhaps having it in the drools codebase makes sense
>> > and opens the way for people to plug in their own soundex.
>> > On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <wolfgang.laun at gmail.com
>> >
>> > wrote:
>> >>
>> >> The implementation of "soundslilke" is broken in more than one respect.
>> >> The conversion of a word to a Soundex string is provided by
>> >> org.mvel2.util.Soundex.
>> >> (.) There are words where Soundex.soundex returns null, so that the
>> >> calling code, in Drools, crashes with a NPE.
>> >> (.) The algorithm implemented in Soundex is erroneous. I'm not sure
>> which
>> >> Soundex algorithm it is supposed to implement, but it just doesn't meet
>> the
>> >> basic requirements.
>> >>
>> >> I have implemented, correctly, the version for the National Archives
>> and
>> >> Records Administration (NARA) rule set for the official implementation
>> of
>> >> Soundex used by the U.S. Government.
>> >>
>> >> Do we wait for MVEL to correct this bug, or do we just replace it with
>> a
>> >> correct implementation?
>> >>
>> >> Regards
>> >> Wolfgang
>>
>>
>> _______________________________________________
>> rules-dev mailing list
>> rules-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/rules-dev
>>
>>
>
>
> --
> Michael D Neale
> home: www.michaelneale.net
> blog: michaelneale.blogspot.com
>
> _______________________________________________
> rules-dev mailing list
> rules-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/rules-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/rules-dev/attachments/20101011/8021f39c/attachment-0001.html 


More information about the rules-dev mailing list