Hardy Ferentschik edited a comment on Bug HSEARCH-1171

Hi Florent,

thank you for taking the time to review this behavior.

No worries. We are in fact looking for some good Tika integration code. That's something we are very interested in and any help is welcome

To answer your question, I need to save the binary in the database, that's part of a requirement.

Fair enough then.

What is really puzzling me is that the same document can be converted in a few seconds in a unit test (ByteArrayBridgeTest) which is the excepted behavior... but can either throw an OutOfMemoryException or can take minutes within an Hibernate search context.

Well, in ByteArrayBridgeTest you don't really do much at all. You just read the input stream, pipe it through Tika and create a string. There is no indexing involved and what's more important no database access. When I run the tests and step through it to see where most time is spend, it is em.flush(); which is the main bottleneck. hsqldb is probably not so well suited for this type of tests. Have you experimented with other databases? Also you might consider working with java.sql.Blob. This way you might not have to load the whole data into memory. Have a look at the org.hibernate.LobHelper class.
See also:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira