a static `stack') to avoid the function call overhead. this cuts about 40%
of the execution time from this function.
No matter what I tried, best results were obtained using __builtin_expect,
so make sure non-gcc compilers do the right thing when they encounter it.