Message Title

Yoann Rodière updated an issue

Change By:	Yoann Rodière

https://github.com/mincong-h/gsoc-hsearch/issues/146

Remaining issues:

# ~~ Shouldn't we expect the HQL/Criteria to produce an ordered list of IDs instead of what's done currently? It would in particular allow to use partitioning, and it doesn't seem much harder to use. ~~ => Actually no, that's a bad idea, since we're not sure *all* IDs in the resulting bounds would be relevant. On the other hand, if we passed an offset and a limit to partitions, instead of passing a first and last ID, we would solve lots of practical issues we're having... (partitioning for HQL/Criteria, support for embedded IDs, ...). But maybe there was a particular reason to do it that way?
# Can this work with checkpoints? We should fix HSEARCH-2616, add checkpoints for the HQL and Criteria cases and test it
# The "maxResults" parameter is questionable:
* why would we only use it when HQL/Criteria is used? Couldn't we simply have something similar to {{org.hibernate.search.MassIndexer.limitIndexedObjectsTo(long)}}?
* why is the limit arbitrarily set to 1 million by default? This could come as a surprise to users.
# The following limitations should be documented:
* There's not query validation before the job's start. If the query is invalid, then the job will failed to process the partition plan on the 2nd step produceLuceneDoc.
* The partitioning is disabled in this HQL approach. Why? Because allowing the parallel execution requires an ordered-by-entity-ID selection (through criteria or HQL), so that PartitionMapper can scroll this selection projected on ID and split it into multiple sub-selections. In each sub-selection, the bounds are limited by the lower bound and the upper bound. However, in HQL approach, the query is given by the user. There's no guarantee for any order.

Add Comment

This message was sent by Atlassian JIRA