The Speex VAD sort of sucks, honestly, or I'm not using it right. Now
trying this algorithm, after denoising:
http://lists.xiph.org/pipermail/speex-dev/2006-March/004269.html
And I'll play around to find the threshold for considering a set of frames
to be "voice" from there.
Also worth noting: we consider the power of the set of frames as a whole, so
you need to sustain power for 0.25 seconds at a time, or it's not "voice."