[
http://jira.jboss.com/jira/browse/DNA-114?page=comments#action_12416362 ]
Michael Trezzi commented on DNA-114:
------------------------------------
Hello,
a preliminary sequencer is done. It lacks some tests and code polish, however the logic is
there. Please let me know what you think whether the extracted things are enough:
From Any MS Office document all metadata are extracted, e.g. Title,
Subject, Keywords, Description, number of pages, .....
From Powerpoint, every slide is sequenced and Slide title, Slide text
and Slide thumbnail are extractred.
From Excel full text and the name of the sheets are extracted.
Future upgrades as Apache POI will evolve, I plan to sequence all Excel chart titles and
if I find a reasonable way how a table of contents from Word documents.
Create MS Office file sequencer
-------------------------------
Key: DNA-114
URL:
http://jira.jboss.com/jira/browse/DNA-114
Project: DNA
Issue Type: Task
Components: Sequencers
Reporter: Randall Hauch
Fix For: 0.2
Create a single sequencer that is capable of sequencing the MS Office files, including MS
Word, MS Excel, and MS PowerPoint. All of the files' standard metadata (author,
title, word count, page count, etc.) should be extracted, as should metadata specific to
the different kinds of files. For example, the sequencer should extract all of the
content of the Excel spreadsheets, while it should extract at least the slide titles (and
ideally thumbnails).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira