If you run 10 request with a single node did you have the same differences?
You need to check where the time is spend, i.e. is it possible to run 10 parallel request on each node. Where the time is spend/lost.
It is difficult to find such issues as there are many different reasons for that