Now what that number needs to be for "optimal" is something that we can't guess and needs to be a tunable which can be controlled by end users.
I totally agree, we don't know anything about the execution environment, so we can't optimize this value. Which is why I would not try to find a value that will perform "best", but rather try to find a value that is the least likely to perform very badly.
Attempting to set "infinite" or other very high figures would not help performance at all, the reason to constrain such pipelines is to make optimal use of all resources.
Sure, but on the other hand setting it to some low value (such as 2) by default may hurt performance. Anyway... The default mandated by the spec (and implemented by both JBatch and JBeret) is to spawn as many threads as partitions, and this seems a bit... aggressive in our case, since we don't constrain the number of partitions (but rather their size). Maybe we could change the way we configure the job: don't ask the users how many rows per partitions + how many threads they want, but just how many partitions, and (if different) how many threads? The number of rows per partitions offers very little anyway: it doesn't affect how frequently we write to the index (only the checkpoint interval does), and as soon as it's higher than the database fetch size it shouldn't affect read performance much. |