[hibernate-issues] [Hibernate-JIRA] Updated: (HSEARCH-473) Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled (Dobes Vandermeer)

Wed Mar 17 17:30:31 EDT 2010

     [ http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sanne Grinovero updated HSEARCH-473:
------------------------------------

    Summary: Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled (Dobes Vandermeer)  (was: Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled)

> Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled (Dobes Vandermeer)
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HSEARCH-473
>                 URL: http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-473
>             Project: Hibernate Search
>          Issue Type: Bug
>          Components: engine
>    Affects Versions: 3.1.1.GA
>            Reporter: Dobes Vandermeer
>            Assignee: Sanne Grinovero
>             Fix For: 3.2.0.Beta2
>
>
> A little-known problem with lucene is that it allocates one or more bytes of memory per document in the index when "norms" are enabled for one or more fields involved in a query.  These norms are used for sorting/ranking but not necessarily for search.
> Currently hibernate search uses special field fields for "id" and "_hibernate_class" with norms *enabled*, which means these norms arrays are created whenever hibernate search deletes something from the index as it uses those fields to find the related document.  This results in unnecessary memory use (many megabytes for an index with millions of records in it).
> Although this can be worked around by simply allocating a larger heap to the JVM it can become quite a significant issue if you plan to support hundreds of simultaneous users on a DB with millions of records; any user action that triggers an entity deletion may cause the norms array to be created, so you may have to allocate hundreds of megabytes of heap just to allow for the creation of these unnecessary norms arrays.  This may still be mitigated by using asynchronous search index updates so that there's a fixed number of threads processing the deletions, I haven't confirmed whether that is the case or not.
> The definition of these fields is hard-coded and they do not pay any attention to any @Field or @FieldBridge annotation on the id of the entity:
> {code:java|title=org.hibernate.search.engine.DocumentBuilderIndexedEntity.getDocument(T, Serializable, Map<String, String>)}
> Field classField =
> 	new Field(
> 		CLASS_FIELDNAME,
> 		entityType.getName(),
> 		Field.Store.YES,
> 		Field.Index.NOT_ANALYZED,
> 		Field.TermVector.NO
> 	);
> 	doc.add( classField );
> 	// now add the entity id to the document
> 	LuceneOptions luceneOptions = new LuceneOptionsImpl(
> 		Field.Store.YES,
> 		Field.Index.NOT_ANALYZED, Field.TermVector.NO, idBoost
> 	);
> 	idBridge.set( idKeywordName, id, doc, luceneOptions );
> {code}
> The fix is relatively straightforward, you simply have to use Field.Index.NO_NORMS_NOT_ANALYZED instead of Field.Index.NOT_ANALYZED for both these fields.
> Related to http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-469

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://opensource.atlassian.com/projects/hibernate/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira