[infinispan-issues] [JBoss JIRA] (ISPN-8980) High concurrency : Infinispan Directory Provider: Lucene : Error loading metadata for index file

Mon Apr 16 09:58:00 EDT 2018

    [ https://issues.jboss.org/browse/ISPN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13561552#comment-13561552 ] 

Gustavo Fernandes edited comment on ISPN-8980 at 4/16/18 9:57 AM:
------------------------------------------------------------------

I created a unit test (against 8.2.x branch) with two nodes using persistent caches similar to your setup: https://github.com/gustavonalle/infinispan/commit/6c7eb8b8a0bc24bf5179510e6f06fc360ee98fe5

The test basically start a node, index a bunch of entities using the Hibernate Search Directory provider, then joins a second node, and checks that all data is transferred via state transfer correctly. The test also stops the first node and query the second node to guarantee all data is queryable and available.

One thing you can observe running the test, is that the size on disk *is not necessarily the same for both nodes*, since the SingleFileStore can do some compactations, and the size on disk can differ depending on how the data is written.

So it seems the issue you are observing is not due to transfer from nodes. Maybe you could try to change the unit test to replicate your problem? So far it seems to me it's the lack of a proper locking issue that is causing index corruption.

was (Author: gustavonalle):
I created a unit test (against 8.2.x branch) with two nodes using persistent caches similar to your setup: https://github.com/gustavonalle/infinispan/commit/e50d6325ccd1ee3fed7b8600c9826123df50f5db

The test basically start a node, index a bunch of entities using the Hibernate Search Directory provider, then joins a second node, and checks that all data is transferred via state transfer correctly. The test also stops the first node and query the second node to guarantee all data is queryable and available.

One thing you can observe running the test, is that the size on disk *is not necessarily the same for both nodes*, since the SingleFileStore can do some compactations, and the size on disk can differ depending on how the data is written.

So it seems the issue you are observing is not due to transfer from nodes. Maybe you could try to change the unit test to replicate your problem? So far it seems to me it's the lack of a proper locking issue that is causing index corruption.

> High concurrency : Infinispan Directory Provider: Lucene : Error loading metadata for index file
> ------------------------------------------------------------------------------------------------
>
>                 Key: ISPN-8980
>                 URL: https://issues.jboss.org/browse/ISPN-8980
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Lucene Directory
>    Affects Versions: 8.2.5.Final
>            Reporter: Debashish Bharali
>            Assignee: Gustavo Fernandes
>            Priority: Critical
>         Attachments: JoiningNode_N2.zip, OriginalNode_N1.zip, SysOutLogs.txt, neutrino-hibernate-search-worker-jgroups.xml, neutrino-hibernatesearch-infinispan.xml
>
>
> During high concurrency of action, we are getting *{color:red}'Error loading metadata for index file'{color}* even in *{color:red}Non-Clustered{color}* env.
> *Hibernate Search Indexes (Lucene Indexes) - 5.7.0.Final*
> *Infinispan - 8.2.5.Final*
> *infinispan-directory-provider-8.2.5.Final*
> *jgroups-3.6.7.Final*
> *Worker Backend : JGroups*
> *Worker Execution: Sync*
> *write_metadata_async: false (implicitly)*
> *Note:* Currently we are on Non-Clustered env. We are moving to Clustered Env within few days.
> On analyzing the code, and putting some additional SYSOUT loggers into FileListOperations and DirectoryImplementor classes, we have established the following points:
> # This is happening during high concurrency on non-clustered env.
> #  One thread *'T1'* is deleting a segment and segment name *'SEG1'* from the *'FileListCacheKey'* list* stored in MetaDatacache*.
> # Concurrently, at the same time, another thread *'T2'* is looping through the FileList ['copy list' from MetadataCache - for -FileListCacheKey - provided by toArray method of *FileListOperations* (changes also being done in the corresponding original list by T1 thread) ].
> # *'T2'* is calling open input method on each segment name - getting corresponding Metadata segment from *MetadataCache*.
> # However, for *'T2'*, the *'copy list'* still contains the name of segment *'SEG1'*.
> # So while looping through the list, *'T2'* tries to get Segment from MetadataCache for segment name *'SEG1'*.
> # But at this instant, *segment* corresponding to segment name *'SEG1'*, has been already removed from *MetadataCache* by *'T1'*.
> # This results in *'java.io.FileNotFoundException: Error loading metadata for index file'* for segment name *'SEG1'*
> # As mentioned earlier, this happens more often during high concurrency.
> *{color:red}On a standalone server (non-clustered), we are getting below error intermittently:{color}*
> Full Stack trace:
> 2018-03-19 17:29:11,938  ERROR [Hibernate Search sync consumer thread for index com.nucleus.integration.ws.server.globalcustomer.entity.GlobalCustomer] o.h.s.e.i.LogErrorHandler [LogErrorHandler.java:69]
> 				*{color:red}HSEARCH000058: Exception occurred java.io.FileNotFoundException: Error loading metadata for index file{color}*: M|segments_w6|com.nucleus.integration.ws.server.globalcustomer.entity.GlobalCustomer|-1
> Primary Failure:
> 	Entity com.nucleus.integration.ws.server.globalcustomer.entity.GlobalCustomer  Id 1649990024999813056  Work Type  org.hibernate.search.backend.AddLuceneWork
> java.io.FileNotFoundException: Error loading metadata for index file: M|segments_w6|com.nucleus.integration.ws.server.globalcustomer.entity.GlobalCustomer|-1
> 	at org.infinispan.lucene.impl.DirectoryImplementor.openInput(DirectoryImplementor.java:138) ~[infinispan-lucene-directory-8.2.5.Final.jar:8.2.5.Final]
> 	at org.infinispan.lucene.impl.DirectoryLucene.openInput(DirectoryLucene.java:102) ~[infinispan-lucene-directory-8.2.5.Final.jar:8.2.5.Final]
> 	at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:109) ~[lucene-core-5.5.4.jar:5.5.4 31012120ebbd93744753eb37f1dbc5e654628291 - jpountz - 2017-02-08 19:08:03]
> 	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:294) ~[lucene-core-5.5.4.jar:5.5.4 31012120ebbd93744753eb37f1dbc5e654628291 - jpountz - 2017-02-08 19:08:03]
> 	at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:171) ~[lucene-core-5.5.4.jar:5.5.4 31012120ebbd93744753eb37f1dbc5e654628291 - jpountz - 2017-02-08 19:08:03]
> 	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:949) ~[lucene-core-5.5.4.jar:5.5.4 31012120ebbd93744753eb37f1dbc5e654628291 - jpountz - 2017-02-08 19:08:03]
> 	at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.createNewIndexWriter(IndexWriterHolder.java:126) ~[hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.getIndexWriter(IndexWriterHolder.java:92) ~[hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.AbstractCommitPolicy.getIndexWriter(AbstractCommitPolicy.java:33) ~[hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.SharedIndexCommitPolicy.getIndexWriter(SharedIndexCommitPolicy.java:77) ~[hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.SharedIndexWorkspaceImpl.getIndexWriter(SharedIndexWorkspaceImpl.java:36) ~[hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriterDelegate(AbstractWorkspaceImpl.java:203) [hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.applyUpdates(LuceneBackendQueueTask.java:81) [hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.run(LuceneBackendQueueTask.java:46) [hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.applyChangesets(SyncWorkProcessor.java:165) [hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.run(SyncWorkProcessor.java:151) [hibernate-search-engine-5.7.0.Final.jar:5.7.0.Final]
> 	at java.lang.Thread.run(Thread.java:785) [na:1.8.0-internal]
> *As per our understanding, this issue should not come in {color:red}'non-clustered'{color} env. Also it should not arise when worker execution is {color:red}'sync'{color}.*
> *We have debugged the code, and confirmed that the value for {color:red}'write_metadata_async'{color} is coming as 'false' only (as expected).*

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)