[hibernate-issues] [JIRA] (HSEARCH-3323) Search 6 groundwork - Restore support for scrolling

Wed May 6 03:58:00 EDT 2020

Yoann Rodière ( https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%3A58fa1ced-171a-4c00-97e8-5d70d442cc4b ) *updated* an issue

Hibernate Search ( https://hibernate.atlassian.net/browse/HSEARCH?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 ) / Task ( https://hibernate.atlassian.net/browse/HSEARCH-3323?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 ) HSEARCH-3323 ( https://hibernate.atlassian.net/browse/HSEARCH-3323?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 ) Search 6 groundwork - Restore support for scrolling ( https://hibernate.atlassian.net/browse/HSEARCH-3323?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 )

Change By: Yoann Rodière ( https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%3A58fa1ced-171a-4c00-97e8-5d70d442cc4b )

h3. Goal

Restore the scroll feature exposed in Search 5 through {{org.hibernate.search.query.hibernate.impl.FullTextQueryImpl#scroll()}}.

h3. API

All located in the {{org.hibernate.search.engine.search.query}} package.

{code}
public interface SearchFetchable {

// ... there's already some code here...

// Add this (+ javadoc):
// Throws IllegalArgumentException if passed 0 or less (see the class Contracts).
SearchScroll<H> scroll(Integer pageSize);

// Add this (+ javadoc):
// Throws IllegalArgumentException if passed 0 or less for pageSize (see the class Contracts).
// Throws IllegalArgumentException if passed less than 0 for offset (see the class Contracts).
// TODO maybe it's not possible to implement this efficiently for Elasticsearch (not sure it accepts an offset when scrolling is enabled). In that case, remove this method.
SearchScroll<H> scroll(Integer offset, Integer pageSize);

}
{code}

{code}
// This will be used like this:
// try (SearchScroll<H> scroll = query.scroll(20)) {
//   for (SearchScrollResult<H> page = scroll.next(); page.hasHits(); page = scroll.next()) {
//     List<H> hits = page.getHits();
//     // ... do something with the page ...
//   }
// }
public interface SearchScroll<H> extends AutoCloseable {

@Override
void close();

// TODO: javadoc
// Returns the next page, with at most "pageSize" hits ("pageSize" defined in the call to query.scroll()).
// May return a result with less than "pageSize" elements if only that many hits are left.
// This should *not* rely on pre-fetching. Fetching should happen when this method is called, not before.
// This is necessary if we want to make it easy for users to clear the ORM session between two pages.
// Note there is no "hasNext" method precisely because we do not do pre-fetching.
SearchScrollResult<H> next();

}
{code}

{code}
public interface SearchScrollResult<H> {

// TODO: javadoc
// This returns true if there are still hits, false otherwise.
// Note hasHits() == true && getHits().isEmpty() *is possible*, in particular if matching entities could not be found in the database.
// This methods is mainly useful as a stop condition in loops.
boolean hasHits();

// TODO: javadoc
List<H> getHits();

// TO BE CHECKED: these may not be implementable efficiently.
// First, let's check if Elasticsearch returns the total hit count/aggregations to the first search API call when scrolling is enabled.
// If it does, let's check the performance impact... Getting this information might require to execute the search query twice, in which case I'd rather not expose this information here and require users to execute the search query twice, explicitly.
// Note that *if* we end up implementing these methods, they will return the same data for every single page.
long getTotalHitCount();
<A> A getAggregation(AggregationKey<A> key);

// TO BE DISCUSSED: if we add this, it will probably be better to wrap this information into a SearchExecutionMetadata object, and implement getLastExecutionMetadata() here.
// As a first step, I would not implement this and would just create a ticket about it.
Duration getTook();
boolean isTimedOut();

}
{code}

h3. To-do list

In order:

# Add APIs, with stub implementations (throw UnsupportedOperationException( "Not yet implemented" );
## Ignore getTotalHitCount/getAggregation/getTook/isTimeout for now.
# Copy-paste {{org.hibernate.search.integrationtest.backend.tck.search.query.SearchQueryFetchIT}} to {{SearchQueryScrollIT}} and adapt it to test scrolling.
## Don't forget to test edge cases: not fetching any result (should work fine), fetching some results but not all of them (should work fine), trying to fetch more than the total hit count (should throw an exception).
## Don't forget to check that {{hasMoreHits()}} returns the correct information.
# Add tests for timeouts (failAfter/truncateAfter) when scrolling.
# Implement scrolling for the stub backend.
# Add tests to the ORM mapper. Will probably need to copy/paste {{org.hibernate.search.integrationtest.mapper.orm.search.loading.SearchQueryEntityLoadingBaseIT}} and adapt it to test loading when calling {{scroll()}} instead of just loading when calling {{fetch()}}.
# Implement scrolling for Elasticsearch.
## This should be easy enough: the first call to fetch*() will execute a search work with the {{scroll}} parameter set, the next calls with execute a scroll work (already implemented, see {{org.hibernate.search.elasticsearch.work.impl.factory.ElasticsearchWorkFactory#scroll}}).
## On close, we will execute a clearScroll work (already implemented, see {{org.hibernate.search.elasticsearch.work.impl.factory.ElasticsearchWorkFactory#clearScroll}}).
# Implement scrolling for Lucene.
## Search 5 code will not be very useful in that regard, as it addresses a lot of problems that are no longer relevant in Search 6.
## Basically, in In the SearchScroll implementation we will need to keep around some of the context that we currently store as local variables in {{LuceneSearcherImpl#search}}: the ` {{ IndexSearcher ` }} and the ` {{ LuceneCollectors ` }} instance in particular.
## When calling {{next()}}:
### First we will need to update the topDocs if necessary: if the topDocs do not include the next page, then update the topDocs
#### See {{org.hibernate.search.query.engine.impl.QueryHits#scoreDoc}} for how to decide how many topDocs to retrieve
#### See phase 1 in {{org.hibernate.search.backend.lucene.search.extraction.impl.LuceneCollectors#collect}}, but *only phase 1*
### Then we will need to collect information for the next page; see the call to {{extractTopDocs}} and phase 2 in {{org.hibernate.search.backend.lucene.search.extraction.impl.LuceneCollectors#collect}}.
## This may prove difficult, maybe let's organize a pair-programming session for that?
# Add Lucene-specific extensions to Scrolling
## This is mainly necessary for Infinispan
## Expose a way to force Lucene to extract TopDocs up to a specific index and retrieve them: {{LuceneSearchScroll#preloadTopDocsUpTo(), returns TopDocs}}
## Expose a way to load a specific document specified by its index: {{LuceneSearchScroll#loadHitByIndex(), returns H}}
## Maybe we can improve on that later; ideally Infinispan should load multiple hits in one call ({{LuceneSearchScroll#loadHitsByIndex(int ...), returns List<H>}}) otherwise the cost of creating collectors for each retrieved hit will be a bit too much.
# Implement {{scroll()}} and {{scroll(ScrollMode)}} in {{HibernateOrmSearchQueryAdapter}}, relying on {{scrollAll()}} under the scene.
## Only {{ScrollMode.FORWARD_ONLY}} will be supported.
## We will need to decide on a page size. Let's use the same size as the loading fetch size, which should be accessible from {{org.hibernate.search.mapper.orm.search.loading.impl.MutableEntityLoadingOptions#getFetchSize}}.
## Some internal windowing will probably be necessary. Just copy/paste the {{org.hibernate.search.elasticsearch.util.impl.Window}} class from Search 5 and adapt it. Do not forget to also copy the unit test, {{org.hibernate.search.elasticsearch.test.WindowTest}}.
## See {{org.hibernate.search.query.hibernate.impl.ScrollableResultsImpl}} for an example of how it was done in Search 5 (may or may not be helpful).
# Add tests for {{scroll()}} and {{scroll(ScrollMode)}} in {{org.hibernate.search.integrationtest.mapper.orm.hibernateormapis.ToHibernateOrmIT}}:
## Nominal case (create scroll, fetch some hits until all hits have been consumed, close).
## Edge cases: not fetching any result (should work fine), fetching some results but not all of them (should work fine), trying to fetch more than the total hit count (should throw an exception).
## Error cases: trying to scroll back, trying to call the {{get*(int)}} methods...
## Check that using any scroll mode other than ScrollMode.FORWARD_ONLY fails.
## Test {{query.stream()}} too (it's based on {{scroll()}}).
# Add tests for {{getResultStream()}} in {{org.hibernate.search.integrationtest.mapper.orm.hibernateormapis.ToJpaIT}}.
# Allow backends to extend the SearchScroll interfaces, like they currently do with {{SearchQuery}} ({{ElasticsearchSearchQuery}}, {{LuceneSearchQuery}}):
## Add a generic parameter {{S extends SearchScroll<H>}} to {{ExtendedSearchFetchable}} and override its {{scroll}} methods to return that type.
## Adapt the interfaces that extend {{ExtendedSearchFetchable}} as necessary.
## Create a new {{ExtendedSearchScroll<H>}} interface using the same principle.
## Create specific interfaces for Elasticsearch and Lucene: {{ElasticsearchSearchScroll}} and {{LuceneSearchScroll}}.
## Implement these interfaces where appropriate.
## Test extensions for Lucene and Elasticsearch. Mainly, check that the scroll has the correct type. See how it's done for SearchResult in  {{org.hibernate.search.integrationtest.backend.elasticsearch.ElasticsearchExtensionIT#query}}.
# Add getTotalHitCount/getAggregation to APIs if relevant and implement them.
# Add getTook/isTimeout to APIs if relevant and implement them.

( https://hibernate.atlassian.net/browse/HSEARCH-3323#add-comment?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 ) Add Comment ( https://hibernate.atlassian.net/browse/HSEARCH-3323#add-comment?atlOrigin=eyJpIjoiZWJiZDk3NWU3OTIxNDYzOGI5ZGY5MmY3YjY3OTNmY2YiLCJwIjoiaiJ9 )

Get Jira notifications on your phone! Download the Jira Cloud app for Android ( https://play.google.com/store/apps/details?id=com.atlassian.android.jira.core&referrer=utm_source%3DNotificationLink%26utm_medium%3DEmail ) or iOS ( https://itunes.apple.com/app/apple-store/id1006972087?pt=696495&ct=EmailNotificationLink&mt=8 ) This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100126- sha1:053c924 )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hibernate-issues/attachments/20200506/13654e6c/attachment.html