This seems to be pretty close to as fast as it gets (might be able to do
better with some shuffles of the negation constants instead of loading
separate constants).
The calculation fails (produces NaN) if the vectors are anti-parallel,
but works for all other combinations. I came up with this implementation
when I discovered Unity's Quaternion.FromToRotation could did not work
with very small angles. This implementation will produce a usable
quaternion below 0.00255 degrees (though it will be slightly larger than
unit). Unity's failed such that I could see KSP's skybox snap while it
rotated around my test vessel.
The better accuracy is for specific cases (90 degree rotations around a
main axis: the matrix element for that axis is now 1 instead of
0.99999994). The speedup comes from doing fewer additions (multiply
seems to be faster than add for fp, at least in this situation).
After messing with SIMD stuff for a little, I think I now understand why
the industry went with xyzw instead of the mathematical wxyz. Anyway, this
will make for less pain in the future (assuming I got everything).
I got the idea from blender when I discovered by accident that quat * vect
produces the same result as quat * qvect * quat* and looked up the code to
check what was going on. While matrix/vector multiplication still beats the
pants off quaternion/vector multiplication, QuatMultVec is a slight
optimization over quat * qvect * quat* (17+,24* vs 24+,32*, plus no need to
to generate quat*).