[hibernate-dev] providing AnalyzerUtils in Hibernate Search

Sanne Grinovero sanne at hibernate.org
Thu May 12 11:31:32 EDT 2011


2011/5/12 Emmanuel Bernard <emmanuel at hibernate.org>:
> Sorry I replied too fast, I see the usecase.
> Do you think proper logging in our part (of the analyzer work) could solve the issue without people having to use AnalyzerUtils?

yes actually some people on the forum where asking about how to enable
logging on tokenization, but currently we pass only the document and
the analyzer instances to the IndexWriter, and Lucene has no logging
(unless we want to use it's infostream redirecting it somehow to a
proper logger).

We could of course "fake it" by doing some work in the area, basically
re-tokenizing the input an additional time if log.isDebugEnabled; it
would be easier if we where going to use the "pre-analyze input"
approach I mentioned, even tough my idea was mainly to offload work
from the master and reduce I/O to the backend.

> For example I'm not a big fan of the fact that AnalyzerUtils plays at the string level whereas HSearch plays at the object property level.

Agreed, we could do better; at this point I'd only want to distribute
the tool to ease debugging of @AnalyzerDef, or just helping
understanding how  it works (you could try figure out the output of
AnalyzerDef posted below [1] - don't laugh)
But there's no strong need to distribute this, as the
hibernate-search-testing jar is available as well and contains it; we
could just add some directions to the documentation for now and think
of better logging with pre-analysis; I guess it could be a good
practice to have tests in any project doing some assertions on
expected analyzer output.

Sanne

[1]
@AnalyzerDef(name = "entityAnalyser",
      tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
      filters = { @TokenFilterDef(factory = StandardFilterFactory.class),
               @TokenFilterDef(factory = LowerCaseFilterFactory.class),
               @TokenFilterDef(factory = StopFilterFactory.class),
               @TokenFilterDef(factory =
SnowballPorterFilterFactory.class, params = { @Parameter(name =
"language", value = "French")}),
               @TokenFilterDef(factory = PhoneticFilterFactory.class
,params = { @Parameter(name="encoder", value="DoubleMetaphone")}),
               @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
               @TokenFilterDef(factory = NGramFilterFactory.class,
params = { @Parameter(name = "minGramSize", value = "3"),
                  @Parameter(name = "maxGramSize", value = "3")   })
               }
      ,charFilters = { @CharFilterDef(factory =
HTMLStripCharFilterFactory.class) }
)


>
>
> On 12 mai 2011, at 16:54, Emmanuel Bernard wrote:
>
>> In which case do you recommend people to write something like that?
>> Is that related to any use of Hibernate Search?
>>
>> Also it seems not all methods as they are make sense, for example display pushing to log is not likely going to be something users want in this form necessarily.
>>
>> On 11 mai 2011, at 17:45, Sanne Grinovero wrote:
>>
>>> We're using AnalyzerUtils in some of the tests in Hibernate Search,
>>> but I'm finding myself recommending people to "write something like
>>> that" quite often lately, it seems that otherwise people have an hard
>>> time to figure if their analyzer definitions make any sense.
>>>
>>> What do you think in moving this class into the main jar, polish the
>>> javadoc a bit and add a note on how to check your analyzer in the
>>> docs?
>>>
>>> Sanne
>>> _______________________________________________
>>> hibernate-dev mailing list
>>> hibernate-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>
>>
>> _______________________________________________
>> hibernate-dev mailing list
>> hibernate-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>
>



More information about the hibernate-dev mailing list