Author: sergiykarpenko
Date: 2010-09-07 11:11:10 -0400 (Tue, 07 Sep 2010)
New Revision: 3068
Added:
jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core/tika-document-reader-service.xml
Modified:
jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core.xml
Log:
EXOJCR-749: TikaDocumentReaderService documentation added
Added:
jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core/tika-document-reader-service.xml
===================================================================
---
jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core/tika-document-reader-service.xml
(rev 0)
+++
jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core/tika-document-reader-service.xml 2010-09-07
15:11:10 UTC (rev 3068)
@@ -0,0 +1,379 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
+"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
+<chapter>
+ <?dbhtml filename="ch-core-document-reader-service.html"?>
+
+ <title>Tika Document Reader Service</title>
+
+ <section>
+ <title>Intro</title>
+
+ <para>DocumentReaderService provides API to retrieve DocumentReader by
+ mimetype.</para>
+
+ <para>DocumentReader lets the user fetch content of document as String or,
+ in case of TikaDocumentReader, as Reader.</para>
+ </section>
+
+ <section>
+ <title>Architecture</title>
+
+ <para>Basicaly, DocumentReaderService is a container for all registered
+ DocumentReaders. So, you can register DocumentReader (method
+ addDocumentReader(ComponentPlugin reader)) and fetch DocumentReader by
+ mimeType (method getDocumentReader(String mimeType)).</para>
+
+ <para>TikaDocumentReaderServiceImpl extends DocumentReaderService with
+ simple goal - read Tika configuration and lazy register each Tika Parser
+ as TikaDocumentReader.</para>
+
+ <para><note>
+ <para>By default, all Tikas Parsers are not registered in readers
+ <mimetype, DocumentReader> map. When user tries to fetch a
+ DocumentReader by unknown mimetype. Than TikaDocumentReaderService
+ checks tika configuration, and register a new mimetype-DocumentReader
+ pair.</para>
+ </note></para>
+ </section>
+
+ <section>
+ <title>Configuration</title>
+
+ <para>How TikaDocumentReaderService Impl configuration looks
+ like:<programlisting><component>
+
<key>org.exoplatform.services.document.DocumentReaderService</key>
+
<type>org.exoplatform.services.document.impl.tika.TikaDocumentReaderServiceImpl</type>
+
+ <!-- Old-style document readers -->
+ <component-plugins>
+ <component-plugin>
+ <name>pdf.document.reader</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.PDFDocumentReader</type>
+ <description>to read the pdf
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerMSWord</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSWordDocumentReader</type>
+ <description>to read the ms word
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerMSXWord</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSXWordDocumentReader</type>
+ <description>to read the ms word
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerMSExcel</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSExcelDocumentReader</type>
+ <description>to read the ms excel
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerMSXExcel</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSXExcelDocumentReader</type>
+ <description>to read the ms excel
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerMSOutlook</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSOutlookDocumentReader</type>
+ <description>to read the ms outlook
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>PPTdocument.reader</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.PPTDocumentReader</type>
+ <description>to read the ms ppt
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>MSXPPTdocument.reader</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.MSXPPTDocumentReader</type>
+ <description>to read the ms pptx
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerHTML</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.HTMLDocumentReader</type>
+ <description>to read the html
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerXML</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.XMLDocumentReader</type>
+ <description>to read the xml
inputstream</description>
+ </component-plugin>
+
+ <component-plugin>
+ <name>TPdocument.reader</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.TextPlainDocumentReader</type>
+ <description>to read the plain text
inputstream</description>
+ <init-params>
+ <!--
+ values-param>
<name>defaultEncoding</name>
<description>description</description>
<value>UTF-8</value>
+ </values-param
+ -->
+ </init-params>
+ </component-plugin>
+
+ <component-plugin>
+ <name>document.readerOO</name>
+ <set-method>addDocumentReader</set-method>
+
<type>org.exoplatform.services.document.impl.OpenOfficeDocumentReader</type>
+ <description>to read the OO
inputstream</description>
+ </component-plugin>
+
+ </component-plugins>
+
+ <init-params>
+ <value-param>
+ <name>tika-configuration</name>
+ <value>jar:/conf/portal/tika-config.xml</value>
+ </value-param>
+ </init-params>
+
+ </component>
+</configuration></programlisting></para>
+
+ <para>tika-config.xml example:<programlisting><properties>
+
+ <mimeTypeRepository magic="false"/>
+ <parsers>
+
+ <parser name="parse-dcxml"
class="org.apache.tika.parser.xml.DcXMLParser">
+ <mime>application/xml</mime>
+ <mime>image/svg+xml</mime>
+ <mime>text/xml</mime>
+ <mime>application/x-google-gadget</mime>
+ </parser>
+
+ <parser name="parse-office"
class="org.apache.tika.parser.microsoft.OfficeParser">
+ <mime>application/excel</mime>
+ <mime>application/xls</mime>
+ <mime>application/msworddoc</mime>
+ <mime>application/msworddot</mime>
+ <mime>application/powerpoint</mime>
+ <mime>application/ppt</mime>
+
+ <mime>application/x-tika-msoffice</mime>
+ <mime>application/msword</mime>
+ <mime>application/vnd.ms-excel</mime>
+
<mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
+ <mime>application/vnd.ms-powerpoint</mime>
+ <mime>application/vnd.visio</mime>
+ <mime>application/vnd.ms-outlook</mime>
+ </parser>
+
+ <parser name="parse-ooxml"
class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
+ <mime>application/x-tika-ooxml</mime>
+
<mime>application/vnd.openxmlformats-package.core-properties+xml</mime>
+
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
+
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.template</mime>
+
<mime>application/vnd.ms-excel.sheet.macroenabled.12</mime>
+
<mime>application/vnd.ms-excel.template.macroenabled.12</mime>
+
<mime>application/vnd.ms-excel.addin.macroenabled.12</mime>
+
<mime>application/vnd.openxmlformats-officedocument.presentationml.presentation</mime>
+
<mime>application/vnd.openxmlformats-officedocument.presentationml.template</mime>
+
<mime>application/vnd.openxmlformats-officedocument.presentationml.slideshow</mime>
+
<mime>application/vnd.ms-powerpoint.presentation.macroenabled.12</mime>
+
<mime>application/vnd.ms-powerpoint.slideshow.macroenabled.12</mime>
+
<mime>application/vnd.ms-powerpoint.addin.macroenabled.12</mime>
+
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime>
+
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.template</mime>
+
<mime>application/vnd.ms-word.document.macroenabled.12</mime>
+
<mime>application/vnd.ms-word.template.macroenabled.12</mime>
+ </parser>
+
+ <parser name="parse-html"
class="org.apache.tika.parser.html.HtmlParser">
+ <mime>text/html</mime>
+ </parser>
+
+ <parser mame="parse-rtf"
class="org.apache.tika.parser.rtf.RTFParser">
+ <mime>application/rtf</mime>
+ </parser>
+
+ <parser name="parse-pdf"
class="org.apache.tika.parser.pdf.PDFParser">
+ <mime>application/pdf</mime>
+ </parser>
+
+ <parser name="parse-txt"
class="org.apache.tika.parser.txt.TXTParser">
+ <mime>text/plain</mime>
+ <mime>script/groovy</mime>
+ <mime>application/x-groovy</mime>
+ <mime>application/x-javascript</mime>
+ <mime>application/javascript</mime>
+ <mime>text/javascript</mime>
+ </parser>
+
+ <parser name="parse-openoffice"
class="org.apache.tika.parser.opendocument.OpenOfficeParser">
+
+
<mime>application/vnd.oasis.opendocument.database</mime>
+
+ <mime>application/vnd.sun.xml.writer</mime>
+ <mime>application/vnd.oasis.opendocument.text</mime>
+
<mime>application/vnd.oasis.opendocument.graphics</mime>
+
<mime>application/vnd.oasis.opendocument.presentation</mime>
+
<mime>application/vnd.oasis.opendocument.spreadsheet</mime>
+ <mime>application/vnd.oasis.opendocument.chart</mime>
+ <mime>application/vnd.oasis.opendocument.image</mime>
+
<mime>application/vnd.oasis.opendocument.formula</mime>
+
<mime>application/vnd.oasis.opendocument.text-master</mime>
+
<mime>application/vnd.oasis.opendocument.text-web</mime>
+
<mime>application/vnd.oasis.opendocument.text-template</mime>
+
<mime>application/vnd.oasis.opendocument.graphics-template</mime>
+
<mime>application/vnd.oasis.opendocument.presentation-template</mime>
+
<mime>application/vnd.oasis.opendocument.spreadsheet-template</mime>
+
<mime>application/vnd.oasis.opendocument.chart-template</mime>
+
<mime>application/vnd.oasis.opendocument.image-template</mime>
+
<mime>application/vnd.oasis.opendocument.formula-template</mime>
+ <mime>application/x-vnd.oasis.opendocument.text</mime>
+
<mime>application/x-vnd.oasis.opendocument.graphics</mime>
+
<mime>application/x-vnd.oasis.opendocument.presentation</mime>
+
<mime>application/x-vnd.oasis.opendocument.spreadsheet</mime>
+
<mime>application/x-vnd.oasis.opendocument.chart</mime>
+
<mime>application/x-vnd.oasis.opendocument.image</mime>
+
<mime>application/x-vnd.oasis.opendocument.formula</mime>
+
<mime>application/x-vnd.oasis.opendocument.text-master</mime>
+
<mime>application/x-vnd.oasis.opendocument.text-web</mime>
+
<mime>application/x-vnd.oasis.opendocument.text-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.graphics-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.presentation-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.spreadsheet-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.chart-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.image-template</mime>
+
<mime>application/x-vnd.oasis.opendocument.formula-template</mime>
+ </parser>
+
+ <parser name="parse-image"
class="org.apache.tika.parser.image.ImageParser">
+ <mime>image/bmp</mime>
+ <mime>image/gif</mime>
+ <mime>image/jpeg</mime>
+ <mime>image/png</mime>
+ <mime>image/tiff</mime>
+ <mime>image/vnd.wap.wbmp</mime>
+ <mime>image/x-icon</mime>
+ <mime>image/x-psd</mime>
+ <mime>image/x-xcf</mime>
+ </parser>
+
+ <parser name="parse-class"
class="org.apache.tika.parser.asm.ClassParser">
+ <mime>application/x-tika-java-class</mime>
+ </parser>
+
+ <parser name="parse-mp3"
class="org.apache.tika.parser.mp3.Mp3Parser">
+ <mime>audio/mpeg</mime>
+ </parser>
+
+ <parser name="parse-midi"
class="org.apache.tika.parser.audio.MidiParser">
+ <mime>application/x-midi</mime>
+ <mime>audio/midi</mime>
+ </parser>
+
+ <parser name="parse-audio"
class="org.apache.tika.parser.audio.AudioParser">
+ <mime>audio/basic</mime>
+ <mime>audio/x-wav</mime>
+ <mime>audio/x-aiff</mime>
+ </parser>
+
+ </parsers>
+
+</properties></programlisting></para>
+ </section>
+
+ <section>
+ <title>Old-style DocumentReaders and Tika Parsers</title>
+
+ <para>As you see configuration above, there is both old-style
+ DocumentReaders and new Tika parsers registered.</para>
+
+ <para><emphasis>But MSWordDocumentReader and
+ org.apache.tika.parser.microsoft.OfficeParser both refer to same
+ "application/msword"</emphasis> mimetype, exclaims attentive reader.
And
+ he is right. But only one DocumentReader will be fetched.</para>
+
+ <para>Old-style DocumentReader registered in configuration become
+ registered into DocumentReaderService. So, mimetypes that is supported by
+ those DocumentReaders will have a registered pair, and user will always
+ fetch this DocumentReaders with getDocumentReader(..) method. Tika
+ configuration will be checked for Parsers only if there is no already
+ registered DocumentReader.</para>
+
+ <section>
+ <title>How to make and register own DocumentReader</title>
+
+ <para>You can make you own DocumentReader in two ways.</para>
+
+ <para><emphasis role="bold">Old-Style Document
+ Reader</emphasis>:<itemizedlist>
+ <listitem>
+ <para>extend BaseDocumentReader <programlisting>public class
MyDocumentReader extends BaseDocumentReader
+{
+ public String[] getMimeTypes()
+ {
+ return new String[]{"mymimetype"};
+ }
+ ...
+}</programlisting></para>
+ </listitem>
+
+ <listitem>
+ <para>register it as
component-plugin<programlisting><component-plugin>
+ <name>my.DocumentReader</name>
+ <set-method>addDocumentReader</set-method>
+ <type>com.mycompany.document.MyDocumentReader</type>
+ <description>to read my own file format</description>
+</component-plugin></programlisting></para>
+ </listitem>
+ </itemizedlist></para>
+
+ <para><emphasis role="bold">Tika
Parser</emphasis>:<itemizedlist>
+ <listitem>
+ <para>implement Parser<programlisting>public class MyParser
implements Parser
+{
+ ...
+}</programlisting></para>
+ </listitem>
+
+ <listitem>
+ <para>register it in tika-config.xml<programlisting>
<parser name="parse-mydocument"
class="com.mycompany.document.MyParser">
+ <mime>mymimetype</mime>
+ </parser></programlisting></para>
+ </listitem>
+ </itemizedlist></para>
+ </section>
+ </section>
+
+ <section>
+ <title>TikaDocumentReader features and notes</title>
+
+ <para>TikaDocumentReader features and notes:<itemizedlist>
+ <listitem>
+ <para>TikaDocumentReader may return document contant as Reader
+ object. Old-Style DocumentReader does not;</para>
+ </listitem>
+
+ <listitem>
+ <para>TikaDocumentReader do not detects document mimetipe. You will
+ get exact parser as configured in tika-config;</para>
+ </listitem>
+
+ <listitem>
+ <para>All readers methods closes InputStream at final.</para>
+ </listitem>
+ </itemizedlist></para>
+ </section>
+</chapter>
Modified: jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core.xml
===================================================================
--- jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core.xml 2010-09-07
14:27:50 UTC (rev 3067)
+++ jcr/trunk/docs/reference/en/src/main/docbook/en-US/modules/core.xml 2010-09-07
15:11:10 UTC (rev 3068)
@@ -8,8 +8,11 @@
<xi:include href="core/core.xml"
xmlns:xi="http://www.w3.org/2001/XInclude" />
-
+
<xi:include href="core/db-creator-service.xml"
xmlns:xi="http://www.w3.org/2001/XInclude" />
+
+ <xi:include href="core/tika-document-reader-service.xml"
+
xmlns:xi="http://www.w3.org/2001/XInclude" />
</part>