Author: hardy.ferentschik
Date: 2010-10-13 12:08:50 -0400 (Wed, 13 Oct 2010)
New Revision: 20828
Modified:
search/trunk/hibernate-search/src/main/docbook/en-US/modules/mapping.xml
Log:
HSEARCH-593 Updated documentation
Modified: search/trunk/hibernate-search/src/main/docbook/en-US/modules/mapping.xml
===================================================================
--- search/trunk/hibernate-search/src/main/docbook/en-US/modules/mapping.xml 2010-10-13
16:04:44 UTC (rev 20827)
+++ search/trunk/hibernate-search/src/main/docbook/en-US/modules/mapping.xml 2010-10-13
16:08:50 UTC (rev 20828)
@@ -25,7 +25,6 @@
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="search-mapping" revision="3">
-
<title>Mapping entities to the index structure</title>
<para>All the metadata information needed to index entities is described
@@ -679,11 +678,11 @@
<section>
<title>Analyzer definitions</title>
- <para>Analyzers can become quite complex to deal with for which reason
- Hibernate Search introduces the notion of analyzer definitions. An
+ <para>Analyzers can become quite complex to deal with. For this reason
+ introduces Hibernate Search the notion of analyzer definitions. An
analyzer definition can be reused by many
- <classname>@Analyzer</classname> declarations. An analyzer
definition
- is composed of:</para>
+ <classname>@Analyzer</classname> declarations and is composed
+ of:</para>
<itemizedlist>
<listitem>
@@ -713,23 +712,39 @@
<para>This separation of tasks - a list of char filters, and a
tokenizer followed by a list of filters - allows for easy reuse of
each individual component and let you build your customized analyzer
- in a very flexible way (just like Lego). Generally speaking the
- <classname>char filters</classname> do some pre-processing in the
- character input, then the <classname>Tokenizer</classname> starts
the
- tokenizing process by turning the character input into tokens which
- are then further processed by the
<classname>TokenFilter</classname>s.
- Hibernate Search supports this infrastructure by utilizing the Solr
- analyzer framework. Make sure to add<filename> solr-core.jar and
- </filename><filename>solr-solrj.jar</filename> to your
classpath to
- use analyzer definitions. In case you also want to use the snowball
- stemmer also include the <filename>lucene-snowball.jar.</filename>
- Other Solr analyzers might depend on more libraries. For example, the
- <classname>PhoneticFilterFactory</classname> depends on <ulink
-
url="http://commons.apache.org/codec">commons-codec</ulin...;.
Your
- distribution of Hibernate Search provides these dependencies in its
- <filename>lib</filename> directory.</para>
+ in a very flexible way (just like Lego). Generally speaking the char
+ filters do some pre-processing in the character input, then the
+ <classname>Tokenizer</classname> starts the tokenizing process by
+ turning the character input into tokens which are then further
+ processed by the <classname>TokenFilter</classname>s. Hibernate
Search
+ supports this infrastructure by utilizing the Solr analyzer framework.
+ </para>
- <example>
+ <tip>
+ <para>Some of the analyzers and filters will require additional
+ dependencies. For example to use the snowball stemmer you have to
+ also include the <literal>lucene-snowball</literal> jar and for
the
+ <classname>PhoneticFilterFactory</classname> you need the
<ulink
+
url="http://commons.apache.org/codec">commons-codec</ulin...
jar.
+ Your distribution of Hibernate Search provides these dependencies in
+ its <filename>lib/optional</filename> directory. Have a look at
+ <xref linkend="table-available-tokenizers" /> and <xref
+ linkend="table-available-filters" /> to see which anaylzers and
+ filters have additional dependencies</para>
+ </tip>
+
+ <para>Let's have a look at a concrete example now - <xref
+ linkend="example-analyzer-def" />. First a char filter is defined
by
+ its factory. In our example, a mapping char filter is used, and will
+ replace characters in the input based on the rules specified in the
+ mapping file. Next a tokenizer is defined. This example uses the
+ standard tokenizer. Last but not least, a list of filters is defined
+ by their factories. In our example, the
+ <classname>StopFilter</classname> filter is built reading the
+ dedicated words property file. The filter is also expected to ignore
+ case.</para>
+
+ <example id="example-analyzer-def">
<title><classname>@AnalyzerDef</classname> and the Solr
framework</title>
@@ -753,29 +768,19 @@
}</programlisting>
</example>
- <para>A char filter is defined by its factory which is responsible for
- building the char filter and using the optional list of parameters. In
- our example, a mapping char filter is used, and will replace
- characters in the input based on the rules specified in the mapping
- file. A tokenizer is also defined by its factory. This example use the
- standard tokenizer. A filter is defined by its factory which is
- responsible for creating the filter instance using the optional
- parameters. In our example, the StopFilter filter is built reading the
- dedicated words property file and is expected to ignore case. The list
- of parameters is dependent on the tokenizer or filter factory.</para>
-
- <warning>
+ <tip>
<para>Filters and char filters are applied in the order they are
- defined in the <classname>@AnalyzerDef</classname> annotation.
Make
- sure to think twice about this order.</para>
- </warning>
+ defined in the <classname>@AnalyzerDef</classname> annotation.
Order
+ matters!</para>
+ </tip>
<para>Once defined, an analyzer definition can be reused by an
- <classname>@Analyzer</classname> declaration using the definition
name
- rather than declaring an implementation class.</para>
+ <classname>@Analyzer</classname> declaration as seen in <xref
+ linkend="example-referencing-analyzer-def" />.</para>
<example>
- <title>Referencing an analyzer by name</title>
+ <title id="example-referencing-analyzer-def"
remap="">Referencing an
+ analyzer by name</title>
<programlisting>@Entity
@Indexed
@@ -798,17 +803,17 @@
</example>
<para>Analyzer instances declared by
- <classname>@AnalyzerDef</classname> are available by their name in
the
- <classname>SearchFactory</classname>.</para>
+ <classname>@AnalyzerDef</classname> are also available by their name
+ in the <classname>SearchFactory</classname> which is quite useful
wen
+ building queries.</para>
<programlisting>Analyzer analyzer =
fullTextSession.getSearchFactory().getAnalyzer("customanalyzer");</programlisting>
- <para>This is quite useful wen building queries. Fields in queries
- should be analyzed with the same analyzer used to index the field so
- that they speak a common "language": the same tokens are reused
- between the query and the indexing process. This rule has some
- exceptions but is true most of the time. Respect it unless you know
- what you are doing.</para>
+ <para>Fields in queries should be analyzed with the same analyzer used
+ to index the field so that they speak a common "language": the same
+ tokens are reused between the query and the indexing process. This
+ rule has some exceptions but is true most of the time. Respect it
+ unless you know what you are doing.</para>
</section>
<section>
@@ -820,23 +825,25 @@
url="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters&quo...;.
Let check a few of them.</para>
- <table>
- <title>Some of the available char filters</title>
+ <table id="table-available-char-filters">
+ <title>Example of available char filters</title>
- <tgroup cols="3">
+ <tgroup cols="4">
<thead>
<row>
<entry align="center">Factory</entry>
<entry align="center">Description</entry>
- <entry align="center">parameters</entry>
+ <entry align="center">Parameters</entry>
+
+ <entry align="center">Additional
dependencies</entry>
</row>
</thead>
<tbody>
<row>
- <entry>MappingCharFilterFactory</entry>
+
<entry><classname>MappingCharFilterFactory</classname></entry>
<entry>Replaces one or more characters with one or more
characters, based on mappings specified in the resource
@@ -848,100 +855,131 @@
"ñ" => "n"
"ø" => "o"
</literallayout> </para></entry>
+
+ <entry>none</entry>
</row>
<row>
- <entry>HTMLStripCharFilterFactory</entry>
+
<entry><classname>HTMLStripCharFilterFactory</classname></entry>
<entry>Remove HTML standard tags, keeping the text</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
</tbody>
</tgroup>
</table>
- <table>
- <title>Some of the available tokenizers</title>
+ <table id="table-available-tokenizers">
+ <title>Example of available tokenizers</title>
- <tgroup cols="3">
+ <tgroup cols="4">
<thead>
<row>
<entry align="center">Factory</entry>
<entry align="center">Description</entry>
- <entry align="center">parameters</entry>
+ <entry align="center">Parameters</entry>
+
+ <entry align="center">Additional
dependencies</entry>
</row>
</thead>
<tbody>
<row>
- <entry>StandardTokenizerFactory</entry>
+
<entry><classname>StandardTokenizerFactory</classname></entry>
<entry>Use the Lucene StandardTokenizer</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
<row>
- <entry>HTMLStripStandardTokenizerFactory</entry>
+
<entry><classname>HTMLStripStandardTokenizerFactory</classname></entry>
<entry>Remove HTML tags, keep the text and pass it to a
StandardTokenizer. @Deprecated, use the
HTMLStripCharFilterFactory instead</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
+
+ <row>
+
<entry><classname>PatternTokenizerFactory</classname></entry>
+
+ <entry>Breaks text at the specified regular expression
+ pattern. </entry>
+
+ <entry><para><literal>pattern</literal>: the
regular
+ expression to use for tokenizing</para><para>group: says
which
+ pattern group to extract into tokens</para></entry>
+
+ <entry><literal>commons-io</literal></entry>
+ </row>
</tbody>
</tgroup>
</table>
- <table>
- <title>Some of the available filters</title>
+ <table id="table-available-filters">
+ <title>Examples of available filters</title>
- <tgroup cols="3">
+ <tgroup cols="4">
<thead>
<row>
<entry align="center">Factory</entry>
<entry align="center">Description</entry>
- <entry align="center">parameters</entry>
+ <entry align="center">Parameters</entry>
+
+ <entry align="center">Additional
dependencies</entry>
</row>
</thead>
<tbody>
<row>
- <entry>StandardFilterFactory</entry>
+
<entry><classname>StandardFilterFactory</classname></entry>
<entry>Remove dots from acronyms and 's from
words</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
<row>
- <entry>LowerCaseFilterFactory</entry>
+
<entry><classname>LowerCaseFilterFactory</classname></entry>
- <entry>Lowercase words</entry>
+ <entry>Lowercases all words</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
<row>
- <entry>StopFilterFactory</entry>
+
<entry><classname>StopFilterFactory</classname></entry>
- <entry>remove words (tokens) matching a list of stop
+ <entry>Remove words (tokens) matching a list of stop
words</entry>
<entry><para><literal>words</literal>: points to
a resource
file containing the stop words</para><para>ignoreCase: true
if
<literal>case</literal> should be ignore when comparing stop
words, <literal>false</literal> otherwise
</para></entry>
+
+ <entry>none</entry>
</row>
<row>
- <entry>SnowballPorterFilterFactory</entry>
+
<entry><classname>SnowballPorterFilterFactory</classname></entry>
<entry>Reduces a word to it's root in a given language. (eg.
protect, protects, protection share the same root). Using such
@@ -950,15 +988,56 @@
<entry><literal>language</literal>: Danish, Dutch,
English,
Finnish, French, German, Italian, Norwegian, Portuguese,
Russian, Spanish, Swedish and a few more</entry>
+
+
<entry><literal>lucene-snowball</literal></entry>
</row>
<row>
- <entry>ISOLatin1AccentFilterFactory</entry>
+
<entry><classname>ISOLatin1AccentFilterFactory</classname></entry>
- <entry>remove accents for languages like French</entry>
+ <entry>Remove accents for languages like French</entry>
<entry>none</entry>
+
+ <entry>none</entry>
</row>
+
+ <row>
+
<entry><classname>PhoneticFilterFactory</classname></entry>
+
+ <entry>Inserts phonetically similar tokens into the token
+ stream</entry>
+
+ <entry><para><literal>encoder</literal>: One of
+ DoubleMetaphone, Metaphone, Soundex or
+ RefinedSoundex</para><para>inject:
<constant>true</constant>
+ will add tokens to the stream, <constant>false</constant>
will
+ replace the existing token
+ </para><para><literal>maxCodeLength</literal>:
sets the
+ maximum length of the code to be generated. Supported only for
+ Metaphone and DoubleMetaphone encodings</para></entry>
+
+ <entry><literal>commons-codec</literal></entry>
+ </row>
+
+ <row>
+
<entry><classname>CollationKeyFilterFactory</classname></entry>
+
+ <entry>Converts each token into its
+ <classname>java.text.CollationKey</classname>, and then
+ encodes the <classname>CollationKey</classname> with
+ <classname>IndexableBinaryStringTools</classname>, to allow
it
+ to be stored as an index term. </entry>
+
+ <entry><literal>custom</literal>,
<literal>language</literal>,
+ <literal>country</literal>,
<literal>variant</literal>,
+ <literal>strength</literal>, <literal>decomposition
+ </literal>see Lucene's
+ <classname>CollationKeyFilter</classname> javadocs for more
+ info</entry>
+
+ <entry><literal>lucene-collation,
commons-io</literal></entry>
+ </row>
</tbody>
</tgroup>
</table>