| Thanks for your discussion about this topic! So we will trust the batch runtime to set a smart threshold value for the number of threads. On our side, we use unlimited threads by default for this job. On the user side, they can still configure their own value if needed. As for the other subject, rows-per-partition vs number-of-partitions, my first concern is to avoid the thread starvation: I want to keep all the threads busy as much as possible, which is achieved by cutting our job in many small partitions—each of which would take a small amount to finish. Once a thread finishes a task (partition), another task (partition) will be assigned by the batch runtime. But maybe Yoann Rodière's proposition, defining the number of partitions & number of threads, would be more efficient. I propose we wait for some user feedback, and discuss it later. |