[rules-dev] soundslike: report on phonetic matching

Mark Proctor mproctor at codehaus.org
Mon Oct 18 14:39:30 EDT 2010


  Sorry the team are away on conference at the moment, so response is slow.

Evaluators are pluggable, so even if something isn't added "out of the 
box" with the core product, there is no reason why it can't be put up in 
an incubator project, say at google code, for people to download as a 
jar and have it "just work". We can then let that incubate over time and 
then include as default eventually.

Wolfgang, seems you have a lot of ideas. It's common for people to 
create "personal" project areas on google code, where they upload 
fragments of documents and code, as a sort of landing pad/scratch pad 
for their ideas. That way everyone can go to one place to get access to 
your ideas and code, and that make it easier for us to incubate, monitor 
and include over time. Especially as sometimes the core team will move 
slower than the productivity output of a community member, such as 
yourself :)

The other aspect is to fix all of this at the MVEL level.

Mark
On 15/10/2010 13:33, Wolfgang Laun wrote:
> As promised, here's my report on investigating Soundex and related 
> algorithms.
>
> (1) MVEL2 has an utility returning a "key" String from a "word" String 
> that is close to what the Original Soundex algorithm is supposed to 
> return. Not being exactly the same doesn't matter much as long you 
> compare results from the same algorithm.
>
> (2) The National Archives and Records Administration has issued a 
> modified Soundex algorithm, which is supposedly slightly better.
>
> (3) Then, there is the New York State Identification and Intelligence 
> System (NYSIIS) Phonetic Encoder, which is reported to be 2.7% better 
> than Soundex.
>
> (4) A modified version of NYSIIS has also been defined, for both of 
> them see http://www.dropby.com/NYSIIS.html. (I have some doubts 
> whether this page reflects a correct implementation of the original 
> NYSIIS algorithm; e.g., "Bahr" returns "B", which can't be correct.)
>
> All of the above are only useful for English pronounciations of proper 
> names.
>
> (5) For German, there is something called "Kölner Phonetik".
>
> I have implemented (1), (2), (3) and (5) in a class 
> SoundsLikeEvaluatorDefinition implements EvaluatorDefinition, 
> implementing the operator soundsLike (note the capital 'L') in the 
> variants soundsLike or soundsLike[orig] for (1), soundsLike[us] for 
> (2), soundsLike[ny] for (3) and soundsLike[de] for (5).
>
> All of this has been an interesting (for me) exercise, but I really 
> don't know whether any of this should go into Drools. (There is the 
> issue of fixing a NPE with the current implementation that calls the 
> MVEL2 code, though.)
>
> It's up to you, Team, to vote on this; I can contribute the 
> aforementioned class, with "soundsLike" replaced by "soundslike" as a 
> replacement for the current implementation (not requiring the MVEL2 
> utility) with the option of using the various operator parameters.
>
> Cheers
> Wolfgang
>
>
> _______________________________________________
> rules-dev mailing list
> rules-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/rules-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/rules-dev/attachments/20101018/4731a72f/attachment.html 


More information about the rules-dev mailing list