Yoann Rodière (
https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%...
) *created* an issue
Hibernate Search (
https://hibernate.atlassian.net/browse/HSEARCH?atlOrigin=eyJpIjoiMmJjZGFm...
) / Improvement (
https://hibernate.atlassian.net/browse/HSEARCH-3878?atlOrigin=eyJpIjoiMmJ...
) HSEARCH-3878 (
https://hibernate.atlassian.net/browse/HSEARCH-3878?atlOrigin=eyJpIjoiMmJ...
) Maximize utilization of database connections during mass indexing (
https://hibernate.atlassian.net/browse/HSEARCH-3878?atlOrigin=eyJpIjoiMmJ...
)
Issue Type: Improvement Assignee: Unassigned Attachments: queueCount10_bulkSize250.svg
Components: mapper-orm Created: 01/Apr/2020 08:19 AM Fix Versions:
6.0.0.Beta-backlog-low-priority Priority: Major Reporter: Yoann Rodière (
https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%...
)
The mass indexer threads responsible for loading entities from the database currently have
a loop that looks like this:
* load X entities from the database
* submit indexing requests for each entity
* wait for indexing to finish for these entities
* repeat
The "wait" step means that, while the backend is busy indexing, the thread will
not load anything from the database. This effectively means that it will hold a reference
to a database connection for nothing during that time.
This means that indexing doesn't execute in parallel of database loading, and as a
result the execution time of mass indexing is probably close to the *sum* of the time
spent loading entities and the time spent indexing. Ideally, we'd want those two
operations to happen in parallel, so that the execution time of mass indexing is close to
the *maximum* instead of the sum.
You can see the effect quite clearly in queueCount10_bulkSize250.svg (
https://hibernate.atlassian.net/secure/attachment/49646/49646_queueCount1...
) , which is a gantt chart of the indexing tasks in an Elasticsearch backend during mass
indexing. From time to time, almost all executors are idle, because entities have not been
submitted yet... If entity loading happened in parallel, the indexing executes would be
less likely to stay idle.
One solution to this problem would be to move to a loop like this:
* load X entities from the database (load #1)
* submit indexing requests for each entity
* load X entities from the database (load #2)
* submit indexing requests for each entity
* wait for indexing to finish for load #1
* load X entities from the database (load #3)
* submit indexing requests for each entity
* wait for indexing to finish for load #2
* load X entities from the database (load #4)
* submit indexing requests for each entity
* wait for indexing to finish for load #3
* ...
This should greatly reduce the amount of waiting in the loading thread, since we will be
loading the next batch of entities while the previous batch is being indexed. And this
preserves the characteristics of the previous algorithm: if database loading is much
faster than indexing (who knows...), we won't flood indexing queues with indexing
requests, and will only ever have at most two batches of entities pending at any given
time.
This would probably be rather easy to implement, since the completion of indexing is
modelled by CompletableFutures : we just have to store the future of the last batch
somewhere, and wait for it to finish.
(
https://hibernate.atlassian.net/browse/HSEARCH-3878#add-comment?atlOrigin...
) Add Comment (
https://hibernate.atlassian.net/browse/HSEARCH-3878#add-comment?atlOrigin...
)
Get Jira notifications on your phone! Download the Jira Cloud app for Android (
https://play.google.com/store/apps/details?id=com.atlassian.android.jira....
) or iOS (
https://itunes.apple.com/app/apple-store/id1006972087?pt=696495&ct=Em...
) This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100123- sha1:26275c0 )