The better accuracy is for specific cases (90 degree rotations around a
main axis: the matrix element for that axis is now 1 instead of
0.99999994). The speedup comes from doing fewer additions (multiply
seems to be faster than add for fp, at least in this situation).
After messing with SIMD stuff for a little, I think I now understand why
the industry went with xyzw instead of the mathematical wxyz. Anyway, this
will make for less pain in the future (assuming I got everything).
I got the idea from blender when I discovered by accident that quat * vect
produces the same result as quat * qvect * quat* and looked up the code to
check what was going on. While matrix/vector multiplication still beats the
pants off quaternion/vector multiplication, QuatMultVec is a slight
optimization over quat * qvect * quat* (17+,24* vs 24+,32*, plus no need to
to generate quat*).