[hibernate-dev] Re: improving Search

Tue Jun 10 16:48:55 EDT 2008

On  Jun 10, 2008, at 07:17, Sanne Grinovero wrote:

> Hello Emmanuel,
> as you asked how to use the SnapshotDeletionPolicy:
> when you create the IndexWriter do:
>
>   IndexDeletionPolicy policy = new KeepOnlyLastCommitDeletionPolicy();
>   SnapshotDeletionPolicy snapshotter = new  
> SnapshotDeletionPolicy(policy);
>   IndexWriter writer = new IndexWriter(dir, autoCommit, analyzer,  
> snapshotter);
>
> then when you want to make a copy you "freeze" the existence of  
> current index segments,
> kindly asking the IndexWriter to avoid r emoving or otherwise change  
> the files your are going to copy;
> it also tells you which files you should copy to get a clone of the  
> index as at snapshot time:
>
> try {
>   IndexCommitPoint commit = snapshotter.snapshot();
>   Collection fileNames = commit.getFileNames();
>   <iterate over & copy files from fileNames>
> } finally {
>   snapshotter.release();
> }
> (credits to Lucene in Action 2°edition MEAP)
> Should I add this improvement to JIRA? I'll add the code examples;  
> also you opened already HSEARCH-152
> is that meant for this same purpose?

Yes that's the goal for HSEARCH-152

>
>
> >> Does it somehow involve not having cluster changes (ie intra VM  
> policy rather than inter VM?)
> I don't really understand what you're asking; I hope previous  
> example contains an answer; it just means you don't need
> to lock the index to make a "hotcopy" to anywhere.

My question is does it somehow involve interacting with the  
IndexWriter so that it does not do stuffs it would do otherwise.
ie if I run indexing on VM1 and VM2, will copying files "from VM1" be  
affected by indexWriter operations from VM2

>
>
> >>The explanation is correct but a bit cataclysmic. We open the  
> reader *only* if there is an actual deletion for the given Directory  
> provider. So in some systems we might very well almost never open a  
> rw reader.
> Well yes it doesn't happen always, but not only for deletions: also  
> all updates are split to to delete+insert work AFAIK.
> if you combine this with the open issue that the index is updated on  
> every dirty entity (not checking if only
> the indexed fields are changed) this translates to lots of re- 
> opening, twice for every transaction updating some entity... (still  
> correct?)

Yes correct.

>
>
> >> We need to check how efficient the implementation has been done.  
> If it a mechanism similar to IndexModifier, it's not worth it.
> Agree, I'll try to discover that.
>
> >>This is what happens by default already, one index directory by  
> entity type. We could compute a flag at init time to know that (it's  
> alrady in place actually) and use it to use a Term query
> >> rather than the full query if only one entity is present in the  
> index. Let's open a JIRA issue
> Well my purpose is to avoid the need to switch from reader to  
> writer, this also simplifies the reordering code were you split all  
> work in two sequences (would be unneded);
> For deletion and eviction we should check to see if the workspace  
> has an indexwriter or indexreader available and use whatever we have  
> available; but this would complicate
> code and don't like it, so I really hope we could really avoid the  
> use of indexreader for modification operations;
> as we found already a good solutions for mass eviction and when the  
> flag says it's ok to delete by id, there's just one case left we  
> should think about.
> Should I open the JIRA for the partial solution as far as we got  
> currently?

Yep

>
>
> >> Not if it's updated in a cluster, right?
> >> Plus seeing the contention lock we have experienced (on  
> IndexReader) in the recent test case, I want to be sure it's  
> actually faster than opening every time.
> Why not? don't you update by using delete+insert in cluster too?

In a cluster, you open the IW do what you have to do protected by the  
global lock, then close the IW (and release the global lock) which  
means the IW on a other machine of the cluster can then have it.
Does it work the same if you keep the IW opened? Does it release the  
global lock? Does it cope with other IWs updating the file system?

>
> I'll write performance tests just to be sure, but for past  
> experience I really expect it to be much faster.
> Also note that all document analysis is done inside the IndexWriter,  
> so not using it concurrently is a bottleneck when the analysis is  
> expensive (think about PDFs in Blobs..);
> during analysis the files are locked to a single thread/transaction  
> but the time spent in really changing files is just a fraction, and  
> Lucene has it's own proper locks in those points.
>
> cheers,
> Sanne