improving Search

About save data into table with...

Hibernate Search: massive batch...

Sanne Grinovero

Saturday, 7 June 2008 Sat, 7 Jun '08

10:32 a.m.

Hi all, I've been thinking on some more design changes for Hibernate Search, I hope to start a creative discussion. A)FSMasterDirectoryProvider and Index Snapshots. is currently making copies of it's changing index; Using Lucene's SnapshotDeletionPolicy you don't need to switch from an index to another, it is possible to take snapshots of current index even while it is in use, and copy it to another destination. The main problem I see with this is that the SnapshotDeletion policy must be given to the IndexWriter constructor, so the DirectoryProvider is going to need interaction with the Workspace; Also the DirectoryPRovider would need to know and potentially block the opening of a new IndexWriter; this could currently need many changes but would be easy when solved together to the next proposal: B) stop using IndexReader in write mode, doing all work in a single IndexWriter. There are currently many locks in Search to coordinate the 3 different operations done on indexes: 1- searching 2- deleting 3- inserting The last two are currently a bottleneck, as each transaction needs to lock the index on commit, open an indexreader (expensive), use it (could be expensive), close it (flushing files), open an indexwriter (expensive), close it (flushing again). Only at the end of this sequence the index is unlocked and the next transaction commit can be processed. (Emmanuel, did I get this short explanation right ? Too catastrophic?) We discussed sharing of these elements across different committing transactions, but it isn't easy to manage and benefits are not clear. If we could avoid the need to use an IndexReader to alter the index, the same IndexWriter could be reused, we could just flush changes when needed. The IndexReader appears to be used in two situations: the first if for "purgeAll(class)"; this can easily be replaced as the same functionality is available in IndexWriter. The other use is removal of an entity from the index: The current remove has many //TODO and a good explanations about why an IndexReader is needed; I hope we could resolve this issue, I have two ideas; The problem is Lucene permits removal only by using a single term, but we uniquely identify an entity by using the couple entityName + dbID Solutions: 1) next release of Lucene will provide removal by Query in IndexWriter, we make a BooleanQuery containing both entityName and database ID and it should be solved. 2) We add an additional field containing a unique identifier for entityname+id, use this for removal. This is probably the fastest way for frequently changing indexes but it will waste a lot of space. 3) To always use sharding when we detect multiple entities use the same index. We could define separated indexes for each different type of entities and transparently bridge them together when the query needs multiple types in return, using the usual MultiReaders. This may result in faster queries for single return types but slower queries for polymorphic queries. When using this sort of sharding deletion becomes easy by single term: the database identifier. If we could enforce the use of one of these strategies we could remove most locking; also the IndexWriter may be used concurrently by each thread needing to commit; it would never need to be closed or reopened as it is always aware of the complete index state. As the Document analysis process is done in the IndexWriter it is very important to use it concurrently; enfin having a single indexWriter per index could unlock much more ideas for additional improvements. Integrating with the "massive batch reindexer" from my previous post would be much easier, and it could become the standard indexing process without having to relax any transaction requirement. kind regards, and sorry for being so verbose Sanne

Show replies by date

Emmanuel Bernard

Monday, 9 June Mon, 9 Jun

2 p.m.

On Jun 7, 2008, at 11:32, Sanne Grinovero wrote:

...

How does SnapshotDeletionPolicy work exactly, I am not terribly familiar with DeletionPolicy yet. Does it somehow involve not haveing cluster changes (ie intra VM policy rather than inter VM?)

...

The main problem I see with this is that the SnapshotDeletion policy must be given to the IndexWriter constructor, so the DirectoryProvider is going to need interaction with the Workspace; Also the DirectoryPRovider would need to know and potentially block the opening of a new IndexWriter; this could currently need many changes but would be easy when solved together to the next proposal:

The DirectoryProvider has access to the SearchFactoryImplementor, so it can go set some flag available to other parts of the system if needed.

...

B) stop using IndexReader in write mode, doing all work in a single IndexWriter. There are currently many locks in Search to coordinate the 3 different operations done on indexes: 1- searching 2- deleting 3- inserting The last two are currently a bottleneck, as each transaction needs to lock the index on commit, open an indexreader (expensive), use it (could be expensive), close it (flushing files), open an indexwriter (expensive), close it (flushing again). Only at the end of this sequence the index is unlocked and the next transaction commit can be processed. (Emmanuel, did I get this short explanation right ? Too catastrophic?) We discussed sharing of these elements across different committing transactions, but it isn't easy to manage and benefits are not clear.

The explanation is correct but a bit cataclysmic. We open the reader *only* if there is an actual deletion for the given Directory provider. So in some systems we might very well almost never open a rw reader.

...

If we could avoid the need to use an IndexReader to alter the index, the same IndexWriter could be reused, we could just flush changes when needed. The IndexReader appears to be used in two situations: the first if for "purgeAll(class)"; this can easily be replaced as the same functionality is available in IndexWriter. The other use is removal of an entity from the index: The current remove has many //TODO and a good explanations about why an IndexReader is needed; I hope we could resolve this issue, I have two ideas; The problem is Lucene permits removal only by using a single term, but we uniquely identify an entity by using the couple entityName + dbID Solutions: 1) next release of Lucene will provide removal by Query in IndexWriter, we make a BooleanQuery containing both entityName and database ID and it should be solved.

We need to check how efficient the implementation has been done. If it a mechanism similar to IndexModifier, it's not worth it.

...

2) We add an additional field containing a unique identifier for entityname+id, use this for removal. This is probably the fastest way for frequently changing indexes but it will waste a lot of space.

I don't like this much as the index structure is very natural so far, there is no field that is specific to HSearch. Even the class type makes sense for third party apps.

...

3) To always use sharding when we detect multiple entities use the same index. We could define separated indexes for each different type of entities and transparently bridge them together when the query needs multiple types in return, using the usual MultiReaders. This may result in faster queries for single return types but slower queries for polymorphic queries. When using this sort of sharding deletion becomes easy by single term: the database identifier.

This is what happens by default already, one index directory by entity type. We could compute a flag at init time to know that (it's alrady in place actually) and use it to use a Term query rather than the full query if only one entity is present in the index. Let's open a JIRA issue

...

If we could enforce the use of one of these strategies we could remove most locking; also the IndexWriter may be used concurrently by each thread needing to commit; it would never need to be closed or reopened as it is always aware of the complete index state.

Not if it's updated in a cluster, right? Plus seeing the contention lock we have experienced (on IndexReader) in the recent test case, I want to be sure it's actually faster than opening every time.

...

As the Document analysis process is done in the IndexWriter it is very important to use it concurrently; enfin having a single indexWriter per index could unlock much more ideas for additional improvements. Integrating with the "massive batch reindexer" from my previous post would be much easier, and it could become the standard indexing process without having to relax any transaction requirement. kind regards, and sorry for being so verbose Sanne

Sanne Grinovero

Tuesday, 10 June Tue, 10 Jun

6:17 a.m.

Hello Emmanuel, as you asked how to use the SnapshotDeletionPolicy: when you create the IndexWriter do: IndexDeletionPolicy policy = new KeepOnlyLastCommitDeletionPolicy(); SnapshotDeletionPolicy snapshotter = new SnapshotDeletionPolicy(policy); IndexWriter writer = new IndexWriter(dir, autoCommit, analyzer, snapshotter); then when you want to make a copy you "freeze" the existence of current index segments, kindly asking the IndexWriter to avoid r emoving or otherwise change the files your are going to copy; it also tells you which files you should copy to get a clone of the index as at snapshot time: try { IndexCommitPoint commit = snapshotter.snapshot(); Collection fileNames = commit.getFileNames(); <iterate over & copy files from fileNames> } finally { snapshotter.release(); } (credits to Lucene in Action 2°edition MEAP) Should I add this improvement to JIRA? I'll add the code examples; also you opened already HSEARCH-152 is that meant for this same purpose?

...

> Does it somehow involve not having cluster changes (ie intra VM policy

rather than inter VM?) I don't really understand what you're asking; I hope previous example contains an answer; it just means you don't need to lock the index to make a "hotcopy" to anywhere.

...

>The explanation is correct but a bit cataclysmic. We open the reader

*only* if there is an actual deletion for the given Directory provider. So in some systems we might very well almost never open a rw reader. Well yes it doesn't happen always, but not only for deletions: also all updates are split to to delete+insert work AFAIK. if you combine this with the open issue that the index is updated on every dirty entity (not checking if only the indexed fields are changed) this translates to lots of re-opening, twice for every transaction updating some entity... (still correct?)

...

> We need to check how efficient the implementation has been done. If it a

mechanism similar to IndexModifier, it's not worth it. Agree, I'll try to discover that.

...

>This is what happens by default already, one index directory by entity

type. We could compute a flag at init time to know that (it's alrady in place actually) and use it to use a Term query

...

> rather than the full query if only one entity is present in the index.

Let's open a JIRA issue Well my purpose is to avoid the need to switch from reader to writer, this also simplifies the reordering code were you split all work in two sequences (would be unneded); For deletion and eviction we should check to see if the workspace has an indexwriter or indexreader available and use whatever we have available; but this would complicate code and don't like it, so I really hope we could really avoid the use of indexreader for modification operations; as we found already a good solutions for mass eviction and when the flag says it's ok to delete by id, there's just one case left we should think about. Should I open the JIRA for the partial solution as far as we got currently?

...

> Not if it's updated in a cluster, right? > Plus seeing the contention lock we have experienced (on IndexReader) in

the recent test case, I want to be sure it's actually faster than opening every time. Why not? don't you update by using delete+insert in cluster too? I'll write performance tests just to be sure, but for past experience I really expect it to be much faster. Also note that all document analysis is done inside the IndexWriter, so not using it concurrently is a bottleneck when the analysis is expensive (think about PDFs in Blobs..); during analysis the files are locked to a single thread/transaction but the time spent in really changing files is just a fraction, and Lucene has it's own proper locks in those points. cheers, Sanne

Emmanuel Bernard

3:48 p.m.

On Jun 10, 2008, at 07:17, Sanne Grinovero wrote:

...

Yes that's the goal for HSEARCH-152

...

My question is does it somehow involve interacting with the IndexWriter so that it does not do stuffs it would do otherwise. ie if I run indexing on VM1 and VM2, will copying files "from VM1" be affected by indexWriter operations from VM2

...

>>The explanation is correct but a bit cataclysmic. We open the reader *only* if there is an actual deletion for the given Directory provider. So in some systems we might very well almost never open a rw reader. Well yes it doesn't happen always, but not only for deletions: also all updates are split to to delete+insert work AFAIK. if you combine this with the open issue that the index is updated on every dirty entity (not checking if only the indexed fields are changed) this translates to lots of re- opening, twice for every transaction updating some entity... (still correct?)

Yes correct.

...

>> We need to check how efficient the implementation has been done. If it a mechanism similar to IndexModifier, it's not worth it. Agree, I'll try to discover that. >>This is what happens by default already, one index directory by entity type. We could compute a flag at init time to know that (it's alrady in place actually) and use it to use a Term query >> rather than the full query if only one entity is present in the index. Let's open a JIRA issue Well my purpose is to avoid the need to switch from reader to writer, this also simplifies the reordering code were you split all work in two sequences (would be unneded); For deletion and eviction we should check to see if the workspace has an indexwriter or indexreader available and use whatever we have available; but this would complicate code and don't like it, so I really hope we could really avoid the use of indexreader for modification operations; as we found already a good solutions for mass eviction and when the flag says it's ok to delete by id, there's just one case left we should think about. Should I open the JIRA for the partial solution as far as we got currently?

Yep

...

>> Not if it's updated in a cluster, right? >> Plus seeing the contention lock we have experienced (on IndexReader) in the recent test case, I want to be sure it's actually faster than opening every time. Why not? don't you update by using delete+insert in cluster too?

In a cluster, you open the IW do what you have to do protected by the global lock, then close the IW (and release the global lock) which means the IW on a other machine of the cluster can then have it. Does it work the same if you keep the IW opened? Does it release the global lock? Does it cope with other IWs updating the file system?

...

I'll write performance tests just to be sure, but for past experience I really expect it to be much faster. Also note that all document analysis is done inside the IndexWriter, so not using it concurrently is a bottleneck when the analysis is expensive (think about PDFs in Blobs..); during analysis the files are locked to a single thread/transaction but the time spent in really changing files is just a fraction, and Lucene has it's own proper locks in those points. cheers, Sanne

Sanne Grinovero

Thursday, 12 June Thu, 12 Jun

5 a.m.

...

> >> Does it somehow involve not having cluster changes (ie intra VM policy > rather than inter VM?) > I don't really understand what you're asking; I hope previous example > contains an answer; it just means you don't need > to lock the index to make a "hotcopy" to anywhere. > My question is does it somehow involve interacting with the IndexWriter so that it does not do stuffs it would do otherwise. ie if I run indexing on VM1 and VM2, will copying files "from VM1" be affected by indexWriter operations from VM2

My proposal to use Index snapshots is to avoid the DP lock when needing to copy somewhere, so indexing can go on without having to wait for file transfers: currently the FSMasterDirectoryProvider could take the lock for a long time. This would be useful in the Master-Slave configuration, where the Lock I want to release eagerly is VM-local as the master is the sole VM writing to the index; so I don't see how I could introduce a multiple-VM problem. Is it possible to combine cluster configurations described in reference 2.2.1 and 2.2.2? I mean are we meant to support a configuration having 2+ VMs acting as Master for the same JMS topic and using the same filesystem index? But I see your point, I'm going to verify if an IndexWriter holding a snapshot will keep it's file-level locks or not, even when being closed in the workspace.

...

> > >> Not if it's updated in a cluster, right? > >> Plus seeing the contention lock we have experienced (on IndexReader) in > the recent test case, I want to be sure it's actually faster than opening > every time. > Why not? don't you update by using delete+insert in cluster too? > In a cluster, you open the IW do what you have to do protected by the global lock, then close the IW (and release the global lock) which means the IW on a other machine of the cluster can then have it. Does it work the same if you keep the IW opened? Does it release the global lock? Does it cope with other IWs updating the file system?

So is Map<DirectoryProvider, ReentrantLock> getLockableDirectoryProviders() in SearchFactoryImpl returning distributed locks? Or are you telling me about Lucene's locks on the index? thanks for all help, Sanne

Emmanuel Bernard

6:54 a.m.

On Jun 12, 2008, at 06:00, Sanne Grinovero wrote:

...

>> Does it somehow involve not having cluster changes (ie intra VM policy rather than inter VM?) I don't really understand what you're asking; I hope previous example contains an answer; it just means you don't need to lock the index to make a "hotcopy" to anywhere. My question is does it somehow involve interacting with the IndexWriter so that it does not do stuffs it would do otherwise. ie if I run indexing on VM1 and VM2, will copying files "from VM1" be affected by indexWriter operations from VM2 My proposal to use Index snapshots is to avoid the DP lock when needing to copy somewhere, so indexing can go on without having to wait for file transfers: currently the FSMasterDirectoryProvider could take the lock for a long time. This would be useful in the Master-Slave configuration, where the Lock I want to release eagerly is VM-local as the master is the sole VM writing to the index; so I don't see how I could introduce a multiple-VM problem. Is it possible to combine cluster configurations described in reference 2.2.1 and 2.2.2? I mean are we meant to support a configuration having 2+ VMs acting as Master for the same JMS topic and using the same filesystem index?

Some people use it that way for some reasons I don't fully understand.

...

But I see your point, I'm going to verify if an IndexWriter holding a snapshot will keep it's file-level locks or not, even when being closed in the workspace. >> Not if it's updated in a cluster, right? >> Plus seeing the contention lock we have experienced (on IndexReader) in the recent test case, I want to be sure it's actually faster than opening every time. Why not? don't you update by using delete+insert in cluster too? In a cluster, you open the IW do what you have to do protected by the global lock, then close the IW (and release the global lock) which means the IW on a other machine of the cluster can then have it. Does it work the same if you keep the IW opened? Does it release the global lock? Does it cope with other IWs updating the file system? So is Map<DirectoryProvider, ReentrantLock> getLockableDirectoryProviders() in SearchFactoryImpl returning distributed locks? Or are you telling me about Lucene's locks on the index?

I am talking about the Lucene lock, not the extra protection Hibernate search provides.

...

thanks for all help, Sanne

6602

days inactive

6607

days old

hibernate-dev@lists.jboss.org

Manage subscription

5 comments

2 participants

tags (0)

participants (2)

Emmanuel Bernard
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

improving Search