great - I guess if it shifts away from "fixed" soundex - probably should try and find out who is using it to ensure there are no surprises. I can't imagine it is widely used. 

On Mon, Oct 11, 2010 at 2:43 PM, Wolfgang Laun <wolfgang.laun@gmail.com> wrote:
On 10 October 2010 23:41, Michael Neale <michael.neale@gmail.com> wrote:
> I think you should clean room implement it (or reuse some old code of yours
> if it is safe to do so). From what I have seen of the algorithm - it isn't
> huge - and it would make sense to have it re-implemented. As an alternative
> - consider taking a look at the MVEL soundex code and rewriting that - and
> we will see if we can make it upstream.

I just re-implemented this according to the algorithm I found in
 http://en.wikipedia.org/wiki/Soundex
I've also consulted a CPAN module, to learn what was intended by the
MVEL implementation, but it's undecidable (possibly due to omissions or
bugs).


> I would say it is just slightly
> neglected  - its not well known that it lives there. Using the MVEL one was
> just opportunistic for drools. 
> I didn't know that it could return null, that is bad. I guess if it is null
> - that would mean that you just do a literal case insensitive compare?

A correct implementation never returns null. An empty word might, but for
our purpose "" would be preferable.


> Also - AFAIK - soundex is only for english right?
Certainly.


> Is there an equivalent for other languages?
Soundex is coarse even for English. I've found the atrocious example that
the Soundex for "Britney Spears" is the same as for
"bewährten Superzicke" (~ "proven super-b*"). NYSIIS is supposed
to be better.

For German, there is an equivalent: "Kölner Phonetik". It might
make sense to provide this for an operator "soundex[de]". (All of
/M[ae][iy]e?r/ sound alike in German, and all exist as proper names.)

I have also found one link to an implementation adapted for French.

Soundex is aimed at the pronunciation of proper names. There might be some
leeway for that even in a language like Hungarian, which is pronounced exactly
as written.

I think Drools should drop the MVEL version and go for a flexible approach,
possibly even s.th. better than Soundex/NARA for English. I'll research this
some more, and report back before I commit anything ;-)

-W



> If so, perhaps having it in the drools codebase makes sense
> and opens the way for people to plug in their own soundex. 
> On Mon, Oct 11, 2010 at 2:54 AM, Wolfgang Laun <wolfgang.laun@gmail.com>
> wrote:
>>
>> The implementation of "soundslilke" is broken in more than one respect.
>> The conversion of a word to a Soundex string is provided by
>> org.mvel2.util.Soundex.
>> (.) There are words where Soundex.soundex returns null, so that the
>> calling code, in Drools, crashes with a NPE.
>> (.) The algorithm implemented in Soundex is erroneous. I'm not sure which
>> Soundex algorithm it is supposed to implement, but it just doesn't meet the
>> basic requirements.
>>
>> I have implemented, correctly, the version for the National Archives and
>> Records Administration (NARA) rule set for the official implementation of
>> Soundex used by the U.S. Government.
>>
>> Do we wait for MVEL to correct this bug, or do we just replace it with a
>> correct implementation?
>>
>> Regards
>> Wolfgang


_______________________________________________
rules-dev mailing list
rules-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev




--
Michael D Neale
home: www.michaelneale.net
blog: michaelneale.blogspot.com