[infinispan-issues] [JBoss JIRA] (ISPN-5103) Inefficient index updates cause high cost merges and increase overall latency

Gustavo Fernandes (JIRA) issues at jboss.org
Thu Jan 8 05:42:29 EST 2015


    [ https://issues.jboss.org/browse/ISPN-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030796#comment-13030796 ] 

Gustavo Fernandes commented on ISPN-5103:
-----------------------------------------

Even being stored, Lucene will store it once, the same for the derived terms. The overhead is that for very large indexes, since this entity is present on all documents, there'll be a big posting list associated with it. Even if the postings are stored and traversed efficiently (skip lists + Vint) we could certainly get rid of it if not needed. 
I'm just thinking if it could be useful when querying directly via Lucene.  

> Inefficient index updates cause high cost merges and increase overall latency
> -----------------------------------------------------------------------------
>
>                 Key: ISPN-5103
>                 URL: https://issues.jboss.org/browse/ISPN-5103
>             Project: Infinispan
>          Issue Type: Enhancement
>          Components: Embedded Querying
>    Affects Versions: 7.0.2.Final, 7.1.0.Alpha1
>            Reporter: Gustavo Fernandes
>            Assignee: Gustavo Fernandes
>
> Currently every change to the index is done Lucene-wise combining two operations:
> * Delete by query, using a boolean query on the id plus the entity class
> * Add 
>  
> Under high load, specially during merges those numerous deletes provoke very long delays causing high latency. 
> We should instead use a simple Lucene Update to add/change documents, since internally it translates to a Delete by term plus an Add operation, and delete by terms are extremely efficient in Lucene.
> Some local tests showed average latency of updating the index using this strategy to drop 4 times, both for the SYNC and ASYNC  backends
> With relation to sharing the index between entities, which was the original motivation of the Delete by query plus add strategy, we have two scenarios:
> * Same cache with multiple entity types: that's a non-issue, since obviously there's no id collision in this case
> * Different caches with the same index: this scenario happens when different caches shares the same index, for ex:
> {code}
> @Indexed(indexName=common)
> public class Country { ... }
> @Indexed(indexName=common)
> public class Currency { ... }
> cm.getCache("currencies").put(1, new Currency(...))
> cm.getCache("countries").put(1, new Country(...))
> {code}
> This would require a delete by query in order to persist both a Country and a Currency with id=1.
> It would also require setting "default.exclusive_index_use", "false", with the associated cost of having to reopen the IndexWriter on every operation.
> Given the performance gain of doing a simple Update is considerable, we should make the corner case supported by extra configuration or alternatively,  generate a unique @ProvidedId, including the entity class or the cache name that work for all cases described above.



--
This message was sent by Atlassian JIRA
(v6.3.11#6341)


More information about the infinispan-issues mailing list