It's a bit flaky for particles, especially at higher frame rates, but
that's due to supporting only 64 overlapping pixels. A reasonable
solution is probably switching to a priority heap for the "sort" and
upping the limit.
Mostly, this gets the stage flags in with the barrier, but also adds a
couple more barrier templates. It should make for slightly less verbose
code, and one less opportunity for error (mismatched barrier/stages).