After messing with SIMD stuff for a little, I think I now understand why
the industry went with xyzw instead of the mathematical wxyz. Anyway, this
will make for less pain in the future (assuming I got everything).
gl, sw and sw32 use blend palettes, so share the code. This also abandons
the optimization for transforming verts in sw (had all sorts of problems
anyway). sw still doesn't work, though.