Just head and tail are atomic, but it seems to work nicely (at least on
intel). I actually had more trouble with gcc (due to accidentally
testing lock-free with the wrong ring buffer... oops, but yup, gcc will
happily optimize your loop to spin really really fast). Also served as a
nice test for C11 threading.