[infinispan-issues] [JBoss JIRA] Commented: (ISPN-939) Index corruption when remote node dies during commit

Mon Feb 21 05:46:13 EST 2011

    [ https://issues.jboss.org/browse/ISPN-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583538#comment-12583538 ] 

Sanne Grinovero commented on ISPN-939:
--------------------------------------

ah, right. What about using the SerialMergeScheduler? I now it's not a great solution, just searching for a temporary workaround for you; also it would be great to know if that helps.

And you didn't answer about the version you're using :) It's quite different if you have ISPN-930 included or not: since ISPN-930 was committed (4.2.1.CR2, which I suppose you're not using because of the xsd issue), the metadata of existing segments is enabled only at segment close.

About your question: the batches are effectively canceled if you're running a transactionmanager and use org.infinispan.lucene.locking.TransactionalLockFactory . For this one to apply index updates and have them visible/committed to the other nodes, you'll have to frequently close the indexWriter (indexwriter close == commit batched changes). So using this approach, index will always be guaranteed in consistent state through the transactionmanager's capabilities but you'll have to wrap your changes in blocks of work (open IW - apply changes - commit & close IW)(repeat). Also note the javadoc of TransactionalLockFactory: you'll need SerialMergeScheduler when using this locking.
Still, an unfinished batch shouldn't have affected the other nodes. you might try cancelling the batch but I'm not sure of the rollback capabilities of a dummy transaction manager (which is the implementation behind batching).

The better solution is to hide the suspect exception, or understand if we can remove it. I'm going to figure out a unit test.

> Index corruption when remote node dies during commit
> ----------------------------------------------------
>
>                 Key: ISPN-939
>                 URL: https://issues.jboss.org/browse/ISPN-939
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Lucene Directory
>    Affects Versions: 4.2.1.CR2
>            Reporter: Tristan Tarrant
>            Assignee: Sanne Grinovero
>         Attachments: read_past_eof.log, suspect_exception_node1.log
>
>
> Using a scenario similar to the one described in ISPN-909:
> Infinispan: 3 caches: lockCache (replicated, volatile, no eviction), metadataCache (replicated, persisted, no eviction), dataCache (distributed, persisted, eviction, hash numOwners=2)
> Node 1: coordinator, IndexWriter open constantly and writing a stream of documents, committing after each one
> Node 2: opens a read-only IndexReader to perform queries, using reopen to keep in sync with the updates coming from node 1
> If we "kill -9" node 2 (to simulate a crash), we get a SuspectException in node 1 during the pre-commit phase (within IndexWriter.commit()). Catching the Throwable we then close() the writer but from then on we get "Read past EOF" errors when trying to access the index (both with readers and writers).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira