New subject: [Hibernate-JIRA] Commented: (HSEARCH-473) Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled

Wednesday, 17 March 2010

Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have
"norms" enabled
------------------------------------------------------------------------------------------------------

                 Key: HSEARCH-473
                 URL:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-473
             Project: Hibernate Search
          Issue Type: Bug
          Components: engine
    Affects Versions: 3.1.1.GA
            Reporter: Dobes Vandermeer

A little-known problem with lucene is that it allocates one or more bytes of memory per
document in the index when "norms" are enabled for one or more fields involved
in a query.  These norms are used for sorting/ranking but not necessarily for search.

Currently hibernate search uses special field fields for "id" and
"_hibernate_class" with norms *enabled*, which means these norms arrays are
created whenever hibernate search deletes something from the index as it uses those fields
to find the related document.  This results in unnecessary memory use (many megabytes for
an index with millions of records in it).

Although this can be worked around by simply allocating a larger heap to the JVM it can
become quite a significant issue if you plan to support hundreds of simultaneous users on
a DB with millions of records; any user action that triggers an entity deletion may cause
the norms array to be created, so you may have to allocate hundreds of megabytes of heap
just to allow for the creation of these unnecessary norms arrays.  This may still be
mitigated by using asynchronous search index updates so that there's a fixed number of
threads processing the deletions, I haven't confirmed whether that is the case or
not.

The definition of these fields is hard-coded and they do not pay any attention to any
@Field or @FieldBridge annotation on the id of the entity:

{code:java|title=org.hibernate.search.engine.DocumentBuilderIndexedEntity.getDocument(T,
Serializable, Map<String, String>)}

Field classField =
	new Field(
		CLASS_FIELDNAME,
		entityType.getName(),
		Field.Store.YES,
		Field.Index.NOT_ANALYZED,
		Field.TermVector.NO
	);
	doc.add( classField );

	// now add the entity id to the document
	LuceneOptions luceneOptions = new LuceneOptionsImpl(
		Field.Store.YES,
		Field.Index.NOT_ANALYZED, Field.TermVector.NO, idBoost
	);
	idBridge.set( idKeywordName, id, doc, luceneOptions );

{code}

The fix is relatively straightforward, you simply have to use
Field.Index.NO_NORMS_NOT_ANALYZED instead of Field.Index.NOT_ANALYZED for both these
fields.

Related to http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-469

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://opensource.atlassian.com/projects/hibernate/secure/Administrators....
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Hibernate-JIRA] Created: (HSEARCH-473) Fields for _hibernate_class and the document ID are hard-coded to be analyzed and have "norms" enabled