[hibernate-dev] Re: improving Search

Tue Jun 10 07:17:57 EDT 2008

Hello Emmanuel,
as you asked how to use the SnapshotDeletionPolicy:
when you create the IndexWriter do:

  IndexDeletionPolicy policy = new KeepOnlyLastCommitDeletionPolicy();
  SnapshotDeletionPolicy snapshotter = new SnapshotDeletionPolicy(policy);
  IndexWriter writer = new IndexWriter(dir, autoCommit, analyzer,
snapshotter);

then when you want to make a copy you "freeze" the existence of current
index segments,
kindly asking the IndexWriter to avoid r emoving or otherwise change the
files your are going to copy;
it also tells you which files you should copy to get a clone of the index as
at snapshot time:

try {
  IndexCommitPoint commit = snapshotter.snapshot();
  Collection fileNames = commit.getFileNames();
  <iterate over & copy files from fileNames>
} finally {
  snapshotter.release();
}
(credits to Lucene in Action 2°edition MEAP)
Should I add this improvement to JIRA? I'll add the code examples; also you
opened already HSEARCH-152
is that meant for this same purpose?

>> Does it somehow involve not having cluster changes (ie intra VM policy
rather than inter VM?)
I don't really understand what you're asking; I hope previous example
contains an answer; it just means you don't need
to lock the index to make a "hotcopy" to anywhere.

>>The explanation is correct but a bit cataclysmic. We open the reader
*only* if there is an actual deletion for the given Directory provider. So
in some systems we might very well almost never open a rw reader.
Well yes it doesn't happen always, but not only for deletions: also all
updates are split to to delete+insert work AFAIK.
if you combine this with the open issue that the index is updated on every
dirty entity (not checking if only
the indexed fields are changed) this translates to lots of re-opening, twice
for every transaction updating some entity... (still correct?)

>> We need to check how efficient the implementation has been done. If it a
mechanism similar to IndexModifier, it's not worth it.
Agree, I'll try to discover that.

>>This is what happens by default already, one index directory by entity
type. We could compute a flag at init time to know that (it's alrady in
place actually) and use it to use a Term query
>> rather than the full query if only one entity is present in the index.
Let's open a JIRA issue
Well my purpose is to avoid the need to switch from reader to writer, this
also simplifies the reordering code were you split all work in two sequences
(would be unneded);
For deletion and eviction we should check to see if the workspace has an
indexwriter or indexreader available and use whatever we have available; but
this would complicate
code and don't like it, so I really hope we could really avoid the use of
indexreader for modification operations;
as we found already a good solutions for mass eviction and when the flag
says it's ok to delete by id, there's just one case left we should think
about.
Should I open the JIRA for the partial solution as far as we got currently?

>> Not if it's updated in a cluster, right?
>> Plus seeing the contention lock we have experienced (on IndexReader) in
the recent test case, I want to be sure it's actually faster than opening
every time.
Why not? don't you update by using delete+insert in cluster too?
I'll write performance tests just to be sure, but for past experience I
really expect it to be much faster.
Also note that all document analysis is done inside the IndexWriter, so not
using it concurrently is a bottleneck when the analysis is expensive (think
about PDFs in Blobs..);
during analysis the files are locked to a single thread/transaction but the
time spent in really changing files is just a fraction, and Lucene has it's
own proper locks in those points.

cheers,
Sanne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hibernate-dev/attachments/20080610/70fbb3fd/attachment.html