I experimented with your example project - https://github.com/hferentschik/hibernate-search-tika/tree/tika-blob-based - and switched to a stream based approach across. This way you don't have to materialize the byte arrays. The Book entity uses java.sql.Blob now I am using the LobHelper to create the Blob (I have to revert to Hibernate specific APIs though).
Another side effect of using Blob_s is that I atm cannot use the mass indexer, but have to use either automatic indexing or the indexing API of _FullTextSession.
I tested this approach also against PostgreSQL and MySQL and in both cases the tests run much faster (6 to 8 seconds for me).
What do you think about this approach?
Another idea regarding Tika integration - we could add a TikaBridge to the Search code base. When used it would dynamically try to discover/load the Tika classes (eg it could look for AutoDetectParser). The bridge could handle multiple types (Blob, byte[], and whatever else we could come up with). WDYT? Is this a good approach to integrate Tika into Search? Any better ideas or suggestions?
I experimented with your example project - https://github.com/hferentschik/hibernate-search-tika/tree/tika-blob-based - and switched to a stream based approach across. This way you don't have to materialize the byte arrays. The Book entity uses java.sql.Blob now I am using the LobHelper to create the Blob (I have to revert to Hibernate specific APIs though).
Another side effect of using Blob_s is that I atm cannot use the mass indexer, but have to use either automatic indexing or the indexing API of _FullTextSession.
I tested this approach also against PostgreSQL and MySQL and in both cases the tests run much faster (6 to 8 seconds for me).
What do you think about this approach?
Another idea regarding Tika integration - we could add a TikaBridge to the Search code base. When used it would dynamically try to discover/load the Tika classes (eg it could look for AutoDetectParser). The bridge could handle multiple types (Blob, byte[], and whatever else we could come up with). WDYT? Is this a good approach to integrate Tika into Search? Any better ideas or suggestions?