On a related note, there is an interesting optimization going on in HCANN:
We should open a dedicated issue for this. However, I would love to she JMH benchmark to see the actual effects as well