[hibernate-commits] Hibernate SVN: r14945 - search/trunk/doc/reference/en/modules.

Thu Jul 17 00:30:45 EDT 2008

Author: epbernard
Date: 2008-07-17 00:30:45 -0400 (Thu, 17 Jul 2008)
New Revision: 14945

Modified:
   search/trunk/doc/reference/en/modules/mapping.xml
Log:
Catch up on doc for HSearch 3.1.0.Beta1

Modified: search/trunk/doc/reference/en/modules/mapping.xml
===================================================================

--- search/trunk/doc/reference/en/modules/mapping.xml	2008-07-17 03:39:44 UTC (rev 14944)
+++ search/trunk/doc/reference/en/modules/mapping.xml	2008-07-17 04:30:45 UTC (rev 14945)
@@ -548,7 +548,223 @@
         the query for a given field.</para>
       </caution>
 
-      <para>analyzer searchFactory.getanalyzer()</para>
+      <section>
+        <title>Analyzer definitions</title>
+
+        <para>Analyzers can become quite complex to deal with. Hibernate
+        Search introduces the notion of analyzer definition. An analyzer
+        definition can be reused by many <classname>@Analyzer</classname>
+        declarations. An analyzer definition is composed of:</para>
+
+        <itemizedlist>
+          <listitem>
+            <para>a name: the unique string used to refer to the
+            definition</para>
+          </listitem>
+
+          <listitem>
+            <para>a tokenizer: a piece of code used to chunk the sentence into
+            individual words</para>
+          </listitem>
+
+          <listitem>
+            <para>a list of filters: each filter is responsible to remove
+            words, modify words and sometimes add words into the stream
+            provided by the tokenizer</para>
+          </listitem>
+        </itemizedlist>
+
+        <para>This separation of tasks (tokenizer, list of filters) allows
+        reuse of each individual component and let you build your ideal
+        analyzer ns a very flexible way (just like a lego). This
+        infrastructure is supported by the Solr analyzer framework. Make sure
+        to add <filename>apache-solr-*.jar</filename> to your classpath to use
+        analyzer definitions: this jar is distributed with your distribution
+        of Hibernate Search and is a striped down version of the Solr
+        jar.</para>
+
+        <programlisting>@AnalyzerDef(name="customanalyzer",
+        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
+        filters = {
+                @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class),
+                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
+                @TokenFilterDef(factory = StopFilterFactory.class, params = {
+                    @Parameter(name="words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ),
+                    @Parameter(name="ignoreCase", value="true")
+                })
+})
+public class Team {
+    ...
+}</programlisting>
+
+        <para>A tokenizer is defined by its factory which is responsible for
+        building the tokenizer and using the optional list of parameters. This
+        example use the standard tokenizer. A filter is defined by its factory
+        which is responsible for creating the filter instance using the
+        opetional paramenters. In our example, the StopFilter filter is built
+        reading the dedicated words property file and is expected to ignore
+        case. The list of parameters is dependent on the tokenizer or filter
+        factory.</para>
+
+        <warning>
+          <para>Filters are applied in the order they are defined in the
+          <classname>@AnalyzerDef</classname> annotation. Make sure to think
+          twice about this order.</para>
+        </warning>
+
+        <para>Once defined, an analyzer definition can be reused by an
+        <classname>@Analyzer</classname> declaration using the definition name
+        rather than declaring an implementation class.</para>
+
+        <programlisting>@Entity
+ at Indexed
+ at AnalyzerDef(name="customanalyzer", ... )
+public class Team {
+    @Id
+    @DocumentId
+    @GeneratedValue
+    private Integer id;
+
+    @Field
+    private String name;
+
+    @Field
+    private String location;
+
+    @Field <emphasis role="bold">@Analyzer(definition = "customanalyzer")</emphasis>
+    private String description;
+}</programlisting>
+
+        <para>Analyzer instances declared by
+        <classname>@AnalyzerDef</classname> are available by their name in the
+        <classname>SearchFactory</classname>.</para>
+
+        <programlisting>Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer("customanalyzer");</programlisting>
+
+        <para>This is quite useful wen building queries. Fields in queries
+        should be analyzed with the same analyzer used to index the field so
+        that they speak a common "language": the same tokens are reused
+        between the query and the indexing process. This rule has some
+        exceptions but is true most of the time, respect it unless you know
+        what you are doing.</para>
+      </section>
+
+      <section>
+        <title>Available analyzers</title>
+
+        <para>Solr and Lucene come with a lot of useful default tokenizers and
+        filters. You can find a complete list of tokenizer factories and
+        filter factories at <ulink
+        url="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters">http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters</ulink>.
+        Let check a few of them.</para>
+
+        <table>
+          <title>Some of the tokenizers avalable</title>
+
+          <tgroup cols="3">
+            <thead>
+              <row>
+                <entry align="center">Factory</entry>
+
+                <entry align="center">Description</entry>
+
+                <entry align="center">parameters</entry>
+              </row>
+            </thead>
+
+            <tbody>
+              <row>
+                <entry>StandardTokenizerFactory</entry>
+
+                <entry>Use the Lucene StandardTokenizer</entry>
+
+                <entry>none</entry>
+              </row>
+
+              <row>
+                <entry>HTMLStripStandardTokenizerFactory</entry>
+
+                <entry>Remove HTML tags, keep the text and pass it to a
+                StandardTokenizer</entry>
+
+                <entry>none</entry>
+              </row>
+            </tbody>
+          </tgroup>
+        </table>
+
+        <table>
+          <title>Some of the filters avalable</title>
+
+          <tgroup cols="3">
+            <thead>
+              <row>
+                <entry align="center">Factory</entry>
+
+                <entry align="center">Description</entry>
+
+                <entry align="center">parameters</entry>
+              </row>
+            </thead>
+
+            <tbody>
+              <row>
+                <entry>StandardFilterFactory</entry>
+
+                <entry>Remove dots from acronyms and 's from words</entry>
+
+                <entry>none</entry>
+              </row>
+
+              <row>
+                <entry>LowerCaseFilterFactory</entry>
+
+                <entry>Lowercase words</entry>
+
+                <entry>none</entry>
+              </row>
+
+              <row>
+                <entry>StopFilterFactory</entry>
+
+                <entry>remove words (tokens) matching a list of stop
+                words</entry>
+
+                <entry><para><literal>words</literal>: points to a resource
+                file containing the stop words</para><para>ignoreCase: true if
+                <literal>case</literal> should be ignore when comparing stop
+                words, <literal>false</literal> otherwise </para></entry>
+              </row>
+
+              <row>
+                <entry>SnowballPorterFilterFactory</entry>
+
+                <entry>Reduces a word to it's root in a given language. (eg.
+                protect, protects, protection share the same root). Using such
+                a filter allows searches matching related words. </entry>
+
+                <entry><para><literal>language</literal>: Danish, Dutch,
+                English, Finnish, French, German, Italian, Norwegian,
+                Portuguese, Russian, Spanish, Swedish</para>and a few
+                more</entry>
+              </row>
+
+              <row>
+                <entry>ISOLatin1AccentFilterFactory</entry>
+
+                <entry>remove accents for languages like French</entry>
+
+                <entry>none</entry>
+              </row>
+            </tbody>
+          </tgroup>
+        </table>
+
+        <para>Don't hesitate to check all the implementations of
+        <classname>org.apache.solr.analysis.TokenizerFactory</classname> and
+        <classname>org.apache.solr.analysis.TokenFilterFactory</classname> in
+        your IDE to see the implementations available.</para>
+      </section>
     </section>
   </section>