the problem is that stb_image can and will allocate much more than it needs to
e.g. for a 2048x2048 BGR image:
it allocates an unnecessary intermediate 12 MB buffer to decode the image
instead of decoding it directly into the final 16 MB RGBA buffer
the old CNQ3 code didn't decode greyscale properly because of a missing macro call
it also didn't range-check memory accesses at all