[rules-users] Text Mining and Drools

Daniel Souza danieldsouza15 at gmail.com
Sun Dec 23 14:54:42 EST 2012


Hi Drools Team,

I'm interested in mix drools with text mining.

I have in mind the process to do it, e.g:

Step 1:
> Tokenization:
   * divide text to separate in a vector of word tokens;
   * eliminate signals and separete hyphens in two tokens.

Step 2:
> Filtering:
   * With a list of most common words built, eliminate this common words in
the vector of word tokens;
   * Assign a sub-vector of synonyms for each token, if possible.

Step 3:
> The interest knowledge for text mining. (It still a cloud in my mind).

*The problem:
What I want to do in this step is to match a TEXT A with TEXT B for protein
function annotation.*
--------------------------------------------------------------------------------------------
Suppose these four situations below:

sequence: >Contig737
hit1: Pre mRNA splicing factor cwf8
hit2: cell control protein cwf8

*Match: hit1: cwf8 <> hit2: cwf8*
It's important to note that the most uncommon word is most important to
consider as an entire match.  

sequence: >Contig1170
hit1: splicing coactivator SRm300 like
hit2: pre mRNA splicing factor CWC21

*Match: hit1: splicing <> hit2: splicing*
In this case splicing could be relevant or not, it will depend of the
biologist knowledge.

sequence: >Contig1431
hit1: transcription factor HAP3
hit2: pre mRNA splicing factor ATP dependent RNA helicase PRP43

*Match: hit1: factor <> hit2: factor*
In this case factor is a common word and I should filter it too.

sequence: >Contig56
hit1: Phosphoribosylpyrophosphate synthetase
hit2: ribose phosphate pyrophosphokinase

*Match: hit1: Phosphoribosylpyrophosphate <> substring-hit2: phosphate*
In this case phosphate match as a substring of Phosphoribosylpyrophosphate.
This match is very relevant to consider as an entire match.
-------------------------------------------------------------

Now I'm thinking in how to evaluate tokens between hit1 and hit2.
What I did before is to match vector with less tokens inside a vector with
more tokens using no sensitive case. Each token was matched as an entire
string or sub-string.

I searched about text mining and biomedical field and there are confused
informations.
Something interesting that I found is the Soundex algorithm that have
another point to face similarities between two texts. Reading in somewhere
in the Drools docs Soundex can be used with drools by an external lib or
class.

I don't know if there is something implemented in Drools to handle text
inside rules.

What I think to do is to build a model that still in cloud to evaluate the
matching between hit1 and hit2. A model to score matches. The most relevant
word can have a high weight.

If someone has a suggestion it will be well appreciated.

Regards,
Daniel Souza



--
View this message in context: http://drools.46999.n3.nabble.com/Text-Mining-and-Drools-tp4021290.html
Sent from the Drools: User forum mailing list archive at Nabble.com.


More information about the rules-users mailing list