As of a recent nvidia driver update, it became necessary to enable the
feature. I guess older drivers (or vulkan validation layers?) were a bit
slack in their checking (or perhaps I didn't actually get those lighting
changes working at the time despite having committed them).
This did involve changing some field names and a little bit of cleanup,
but I've got a better handle on what's going on (I think I was in one of
those coding trances where I quickly forget how things work).
Just head and tail are atomic, but it seems to work nicely (at least on
intel). I actually had more trouble with gcc (due to accidentally
testing lock-free with the wrong ring buffer... oops, but yup, gcc will
happily optimize your loop to spin really really fast). Also served as a
nice test for C11 threading.
This makes bsp traversal even more re-entrant (needed for shadows).
Everything needed for a "pass" is taken from bsp_pass_t (or indirectly
through bspctx_t (though that too might need some revising)).
Ambient lights are represented by a point at infinity and a zero
direction vector (spherical lights have a non-zero direction vector but
the cone angle is 360 degrees). This fixes what appeared to be mangled
light renderers (it was actually just an ambient light being treated as
a directional light (point at infinity, but non-zero direction vector).
There are some issues with the light renderers getting mangled, and only
the first light is even touched (just begin and end of render pass), but
this gets a lot of the framework into place.
Sounds odd, but it's part of the problem with calling two different
things with essentially the same name. The "high level" render pass in
question may be a compute pass, or a complex series of (Vulkan) render
passes and so won't create a Vulkan render pass for the "high level"
render pass (I do need to come up with a better name for it).
I really don't remember why I made it separate, though it may have been
to do with r_ent_queue. However, putting it together with the rest is
needed for the "render pass" rework.
It now lives in vulkan_renderpass.c and takes most of its parameters
from plist configs (just the name (which is used to find the config),
output spec, and draw function from C). Even the debug colors and names
are taken from the config.
QFV_CreateRenderPass is no longer used, and QFV_CreateFramebuffer hasn't
been used for a long time. The C file is still there for now but is
basically empty.
The real reason for the delay in implementing support for pNext is I
didn't know how to approach it at the time, but with the experience I've
gained using and modifying vkparse, the solution turned out to be fairly
simple. This allows for the use of various extensions (eg, multiview,
which was used for testing, though none of the hookup is in this
commit). No checking is done on the struct type being valid other than
it must be of a chainable type (ie, have its own pNext).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Enabled by 'developer lighting'. It was good for confirming that the
lights in ad_e1m1 (Doom Hangar 16) were actually being output (over 600
of them sometimes, ouch). Turned out to be the color scale ambiguity.
The pitch cvars are taken from quakespasm because I ran into a button I
couldn't shoot with the 80 degree limit, but I figured I'd add roll
limits while I was at it.
Surfaces marked with SURF_DRAWALPHA but not SURF_DRAWTURB are put in a
separate queue for the water shader and run with a turb scale of 0.
Also, entities with colormod alpha < 1 are marked to go in the same
queue as SURF_DRAWALPHA surfaces (ie, no SURF_DRAWTURB unless the
model's texture indicated such).
Textures whose names start with a { are meant to be rendered with
transparency. Surfaces using those textures are marked with
SURF_DRAWALPHA.
Unfortunately, the mip levels of ad_tears' transparent textures use the
wrong color so only the highest LOD works properly, but those textures
are meant to be loaded from external files anyway, it seems.
A listener is used instead of (really, as well as) ie_app_window events
because systems that need to know about windows sizes may not have
anything to do with input and the event system.
This breaks console scaling for now (con_width and con_height are gone),
but is a major step towards window resize support as console stuff
should never have been in viddef_t in the first place.
The client screen init code now sets up a screen view (actually the
renderer's scr_view) that is passed to the client console so it can know
the size of the screen. The same view is used by the status bar code.
Also, the ram/cache/paused icon drawing is moved into the client screen
update code. A bit of duplication, but I do plan on merging that
eventually.
Things seem to be at least close to the right place now.
Input line handling has been made more object-oriented in that the
collection of objects required for a single input line (command, say,
say_team) are bundled into one object with just one set of handlers for
resize and draw. Much tidier.
view_new sets the geometry, but any setgeometry that need a valid data
pointer would get null. It might be better to always have the data
pointer, but I didn't feel like doing such a change at this stage as
there are quite a lot of calls to view_new. Thus view_new_data which
sets the data pointer before calling setgeometry.
More tuning is needed on the actual splits as it falls over when the
lower rect gets too low for the subrects being allocated. However, the
scrap allocator itself will prefer exact width/height fits with larger
cutoff over inexact cuts with smaller cutoff. Many thanks to tdb for the
suggestions.
Fixes the fps dropping from ~3700fps down to ~450fps (cumulative due to
loss of POT rounding and very poor splitting layout), with a bonus boost
to about 4900fps (all speeds at 800x450). The 2d sprites were mostly ok,
but the lightmaps forming a capital gamma shape in a 4k texture really
hurt. Now the lightmaps are a nice dense bar at the top of the texture,
and 2d sprites are pretty good (slight improvement coming next).
This replaces old_console_t with con_buffer_t for managing scrollback,
and draw_charbuffer_t for actual character drawing, reducing the number
of calls into the renderer. There are numerous issues with placement and
sizing, but the basics are working nicely.
I really don't know why I tried to do ring-buffers without gaps, the
code complication is just not worth the tiny savings in memory. In fact,
just the switch from pointers to 32-bit offsets saves more memory than
not having gaps (on 64-bit systems, no change on 32-bit).
I've decided that appending to a full single-line buffer should simply
scroll through the existing text. Unsurprisingly, the existing code
doesn't handle the situation all that well. While I've already got a fix
for it, I think I've got a better idea that will handle full buffers
more gracefully.
This fixes the current line object getting corrupted by the tail line
update when the buffer is filled with a single line. There are probably
more tests to write and bugs to fix :)
I was looking through the code for Con_BufferAddText trying to figure
out what it was doing (answer: ring buffer for both text and lines) and
got suspicious about its handling of the line objects. I decided an
automated test was in order. It turns out I was right: filling the
buffer with a single long line causes the tail line to trample the
current line, setting its pointer and length to 0 when the final
character is put in the buffer.
It handles basic cursor motion respecting \r \n \f and \t (might be a
problem for id chars), wraps at the right edge, and automatically
scrolls when the cursor tries to pass the bottom of the screen.
Clearing the buffer resets its cursor to the upper left.
QFS_LoadFile closes its file argument (this is a design error resulting
from changing QFS_LoadFile to take a file instead of a path and not
completing the update), resulting in the call to Qfilesize accessing
freed memory.
This is intended for the built-in 8x8 bitmap characters and quake's
"conchars", but could potentially be used for any simple (non-composed
characters) mono-spaced font. Currently, the buffers can be created,
destroyed, cleared, scrolled vertically in either direction, and
rendered to the screen in a single blast.
One of the reasons for creating the buffer is to make it so scaling can
be supported in the sw renderer.
PR_Debug_ValueString prints the value at the given offset using the
provided type to format the string. The formatted string is appended to
the provided dstring.
PR_Debug_ValueString prints the value at the given offset using the
provided type to format the string. The formatted string is appended to
the provided dstring.
For whatever reason, building under MXE (for windows) causes FLAC to try
to use dll import references, but setting FLAC__NO_DLL before including
FLAC/export.h fixes the issue.
For whatever reason, building under MXE (for windows) causes FLAC to try
to use dll import references, but setting FLAC__NO_DLL before including
FLAC/export.h fixes the issue.
While this does pull the grovelling for the subpic out to the callers,
the real problem is the excessive use of qpic_t in the internal code:
qpic_t is really just the image format in wad files, and shouldn't be
used as a generic image handle.
Cleans up more of the icky code in the font drawing functions.
This makes working with quads, implied alpha quads, and lines much
cleaner (and gets rid of the bulk of the "eww" fixme), and will probably
make it easier to support multiple scraps and fonts, and potentially
more flexible ordering between pipelines.
-describe is sent to the object, and the returned string passed back.
There is a worry about the lifetime of the returned string as there's
currently no way of both ensuring it doesn't get freed prematurely and
ensuring it does eventually get freed.
If no handler has been registered, then the corresponding parameter is
printed as a pointer but with surrounding brackets (eg, [0xfc48]). This
will allow the ruamoko runtime to implement object printing.
-describe is sent to the object, and the returned string passed back.
There is a worry about the lifetime of the returned string as there's
currently no way of both ensuring it doesn't get freed prematurely and
ensuring it does eventually get freed.
If no handler has been registered, then the corresponding parameter is
printed as a pointer but with surrounding brackets (eg, [0xfc48]). This
will allow the ruamoko runtime to implement object printing.
This means that QF should support more exotic fonts without any issue
(once the rest of the text handling system is up to snuff) as HarfBuzz
does all the hard work of handling OpenType, Graphite, etc text shaping,
including kerning (when enabled).
Also, font loading now loads all the glyphs into the atlas (preload is
gone).
While the results are a little surprising (tends to alternate between
left side and top for allocations), there is much less wasted space in
the partially allocated regions, and the main free region seems to
always be quite big.
While VRect_Difference worked for subrect allocation, it wasn't ideal as
it tended to produce a lot of long, narrow strips that were difficult to
reuse and thus wasted a lot of the super-rectangle's area. This is
because it always does horizontal splits first. However, rewriting
VRect_Difference didn't seem to be the best option.
VRect_SubRect (the new function) takes only width and height, and splits
the given rectangle such that if there are two off-cuts, they will be
both the minimum and maximum possible area. This does seem to make for
much better utilization of the available area. It's also faster as it
does only the two splits, rather than four.
It is currently an ugly hack for dealing with the separate quad queue,
and the pipeline handling code needs a lot of cleanup, but it works
quite well, though I do plan on moving to HarfBuzz for text shaping. One
nice development is I got updating of descriptor sets working (just need
to ensure the set is no longer in use by the command queue, which the
multiple frames in flight makes easy).
It's implemented only in the Vulkan renderer, partly because there's a
lot of experimenting going on with it, but the glyphs do get transferred
to the GPU (checked in render doc). No rendering is done yet: still
thinking about whether to do a quick-and-dirty test, or to add HarfBuzz
immediately, and the design surrounding that.
R8G8B8A8 was hard-coded by accident when creating Vulkan_LoadTexArray
(or probably even the original Vulkan_LoadTex). This wasn't a problem
while everything was loaded in that format, but attempting to load an R8
texture didn't go so well. The same format as the image itself is used
now (correctly so).
I have recently learned that pre-multiplied alpha is the correct way to
do compositing, which is pretty much what the 2d pass does (actually,
all passes, but...). This required ensuring the color factor passed to
the fragment shader is pre-multiplied (a little silly for cshifts as
they used to be pre-multiplied but were un-pre-multiplied early in QF's
history and I don't feel like fixing that right now as it affects all
renderers), and also pre-multiplying alpha when converting from 8-bit
palette to rgba as the palette entry for transparent has that funky pink
(which is used in full-brights).
I will need to do more work to improve the 2d allocation, but rounding
up the requested sizes to the next power of two proved to be excessively
wasteful: I was able to allocate spots for only half of the sub-pics I
needed (though I did still need to double the number of pixels in the
end).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Due to the mis-initialization of the union used to parse the color
vector, the intensity was incorrectly set to zero and thus the light
dropped, meaning that all lights in ad_tears were lost.
The extend instruction is for loading narrower data types into wider
data types, eg, single element into 2, 3, or 4 element types, with a
small set of extension schemes: 0, 1, -1, copy (for 1->any and 2 -> 4).
Possibly most importantly, it works with unaligned data.
Progress towards #30
Most were pretty easy and fairly logical, but gib's regex was a bit of a
pain until I figured out the real problem was the conditional
assignments.
However, libs/gamecode/test/test-conv4 fails when optimizing due to gcc
using vcvttps2dq (which is nice, actually) for vector forms, but not the
single equivalent other times. I haven't decided what to do with the
test (I might abandon it as it does seem to be UD).
This gets ambient sounds (in particular, water and sky) working again
for quakeworld after the recent sound changes, and again for nq after I
don't know how long.
Because the calculation didn't take the hunk header size (which is not
included in the hunk size) into account, the conversion to MB was one
short and thus the rounding up to the next 8 MB boundary was giving the
current total hunk size (ie, the already given size). Most confusing to
a user ("But I already asked for 128MB!").
It turns out that copying just "unknown" is a significant performance
hit when doing over 100M allocations. Making Hunk_RawAlloc the core and
initializing the name field with a single 0 shaved about a second off
`qfvis gmsp3v2.bsp` (from about 39s to about 38s).
My reason for using Hunk_HighAlloc for allocating cache blocks was to
lock them down so they were safe for the sound mixer to access when
running in a real-time thread. However, I had never tested under tight
memory constraints, which proved that the design (or maybe just
implementation) just wasn't robust. However, now that sounds are loaded
into a completely separate region, it's safe to put the cache back to
its original behaviour (still with 64-byte alignment and such, of
course). This will even allow the high hunk to be used again, though it
effectively was anyway with Hunk_TempAlloc.
I never liked "cache" as a name because it said where the sound was
stored rather than how it was loaded/played, but "stream" is ok, since
that's pretty much spot on. I'm not sure "block" is the best, but it at
least makes sense because the sounds are loaded as a single block (as
opposed to being streamed). An now, neither use the cache system.
Nuclear powered audio ;)
More seriously, use _Atomic on a few fields that very obviously need it.
That is, channel's buffer pointer (used to signal to the mixer that the
channel is ready for use) and "flow control" flags (stop, done and
pause), and head and tail in the buffer itself. Since QF has been
working without _Atomic (admittedly, thanks to luck and x86's strong
memory model), this should do until proven otherwise. I imagine getting
stream reading out of the RT thread will highlight any issues.
Turned out the channels simply weren't being freed by SND_ScanChannels
when they should have been (probably a good thing, too, as it wasn't
being told to wait for the mixer).
Care needs to be taken when freeing channels as doing so while an
asynchronous mixer is using them is unlikely to end well. However,
whether the mixer is asynchronous depends on the output driver. This
lets the driver inform the rest of the system that the output and mixer
are running asynchronously.
SYS_dev is a holdover from when we had only the one flag and is not
meant to be used for tests (I seem to remember mentioning an audit was
necessary, but obviously forgotten). One step at a time, I guess :)
This improves the locality of reference when mixing and removes the
proxy sfx for streamed sounds.
The buffer for streamed sounds is allocated when the stream is opened
(since streamed sounds can't share buffers), and freed when the stream
is closed.
For block sounds, the buffer is reference counted (with the sfx holding
one reference, so currently block buffers never get freed), with their
reference count getting incremented on open and decremented on close.
That the reference counts get to 1 has been confirmed, so all that
should be needed is proper destruction of the sfx instances.
Still need to sort out just why channels leak across level changes.
Getting the tag is possibly useful in general and definitely in
debugging. Setting, I'm not so sure as it should be done when allocated,
but that's not always possible.
Also, correct the return type of z_block_size, though it affected only
Z_Print. While an allocation larger than 4GB is... big for zone, the
blocks do support it, so printing should too.
They're currently treated as non-fatal, those sounds just won't ever
play. This allows ad_tears to at least load with only 32MB of locked
memory (it needs somewhere between 64 and 96).
Since Ruamoko got vector types, zone's 8-byte alignment was no longer
sufficient due to hardware-enforced alignment requirements of the
underlying vector operations.
Fixes#28.
And use it for Ruamoko object reference counts.
I need reference counts for dealing with block sound buffers since they
can be shared by many channels. I figured I take care of Ruamoko's
reference count location at the same time.
Fixes#27.
Sounds no longer use the cache, which is good for multi-threaded, but a
pain for memory management: the buffers are shared between channels that
play back the sounds, but when the sounds were cached, they were
automagically (thus problematically) freed when the space was needed.
That no longer happens, so they leak. I think the solution is to use
reference counting and retain/release in sfx->open() and sfx->close().
Streams are the easy one as they were never in the cache. As a side
effect, sfxstream_t is much smaller as it no longer has the buffer
embedded in the struct.
SND_AllocChannel is a little too aggressive in freeing channels that
have finished as the channel may be externally owned (eg, by cd_file).
Get bgm looping working again.
More shrinkage. It turned out the mixer uses the phase fields, so they
couldn't be removed, but even at 192kHz, +/- 127 samples produces
sufficient phase separation for a 21cm head (which is, actually, pretty
big: mine is about 15cm across), but that change can come later.
The ambient sound loading has been removed from snd_channels because 1)
it doesn't work for nq, 2) it should never have been there in the first
place (it belongs in the client, but that needs some more API).
This is part of a process to shrink channel_t so it doesn't waste locked
memory when it gets moved there. Eventually, only the fields the mixer
needs will be in channel_t itself: those needed for spacialization will
be moved into a separate array.
In the process, I found that channels leak across level changes, but
this appears to be due to the cached sounds being removed during loading
and the mixer never marking them as done (it sees the null sfx pointer
and assumes the channel was never in use). Having the mixer mark the
channel as done seems to fix the leak, but cause a free channel list
overflow. Rather than fight with that, I'll leave the leak for now and
fix it at its root cause: the management of the sound samples
themselves.
Sys_DoubleTime starts at 4Gs in order to keep its precision fixed for a
nice long time (about 120 years, iirc).
This fixes an instant watchdog trigger when first starting up in
testsound. I'm not sure why it didn't happen with nq, but I guess that
doesn't really matter
The scaling up of the volumes when setting a channel's volume bothered
me. The biggest issue being it hasn't been necessary for over a decade
since the conversion to a float-mixer. Now the volume and attenuation
scaling from protocol bytes is entirely in the client's hands.
This does mean that the gl and sw renderers can no longer call
S_ExtraUpdate, but really, they shouldn't be anyway. And I seem to
remember it not really helping (been way too long since quake ran that
slowly for me).
sfx_t is now private, and cd_file no longer accesses channel_t's
internals. This is necessary for hiding the code needed to make mixing
and channel management *properly* lock-free (I've been getting away with
murder thanks to x86's strong memory model and just plain luck with
gcc).
The tests fail as they exercise how the cache *SHOULD* work rather than
how it does now.
The tests do currently pass for the pending work I've done on the cache
system, but while working on it, I remembered why I reworked cache
allocation...
The essential problem is that sounds are loaded into the cache, which is
fine for synchronous output targets, but has proven to be a minefield
for asynchronous output targets (JACK, ALSA).
The reason for the minefield is the hunk takes priority over the cache,
and is free to move cache blocks around, and *even dispose of them
entirely* in order to satisfy memory allocations from either end of the
hunk. Doing this in an entirely single-threaded process (as DOS Quake
was) is perfectly safe, as the users of the cache just reload the
pointer each time, and bail if it's null (meaning the block has been
freed), or even cause the data to be reloaded if possible (I'm a little
fuzzy on the details for that as I didn't write that code). However, in
multi-threaded code, especially real-time (JACK, possibly ALSA), it's a
recipe for disaster. The 4cab5b90e6 commit was a (mostly) successful
attempt to mitigate the problem by allocating the cache blocks from the
high-hunk (thus minimizing any movement caused by low-hunk allocations),
it resulted in cache allocates and regular high-hunk allocations somehow
getting intertwined: while investigating just how much memory ad_tears
needs (somewhere between 192MB and 256MB), I got "trashed sentinel"
errors and upon investigation, I found what looks very suspiciously like
audio data written across a hunk control block.
I've decided that the cache allocation *algorithm* should be reverted to
how it was originally designed by Id (details will remain "modern"), but
while working on the tests, I remembered why I had done the changes in
the first place (above story). Thus the work on reverting the cache
allocation can't go in until I get sound memory management independent
of the cache. The tests are going in now so I have a constant reminder :)
And make Sys_MaskPrintf take the developer enum rather than just a raw
int.
It was actually getting some nasty hunk corruption errors when under
memory pressure that made it clear the sound system needs some work.
I had been trimming for the solid leaf, but not the empty leafs. I had
assumed the vis tool would trim the bits, but it seems to not be
reliable (though it could be a bug in qfvis, I think the map in question
is one of my test maps).
The texture animation data is compacted into a small struct for each
texture, resulting in much less data access when animating the texture.
More importantly, no looping over the list of frames. I plan on
migrating this to at least the other hardware renderers.
I found a test map with no texture data. Even after fixing the bsp
loader, vulkan didn't like it. Now vulkan is happy. The "Missing" text
is full-bright magenta on a dim grey background so it should be visible
in any lighting conditions.
Conflagrant Rodent has a sub-model with 0 faces (double bit error?)
causing simply counting faces to get out of sync with actual model
starts thus breaking *all* brush models that come after it (including
other maps). Thus be a little less lazy in figuring out model start
faces.
The models are broken up into N sub-(sub-)models, one for each texture,
but all faces using the same texture are drawn as an instance, making
for both reduced draw calls and reduced index buffer use (and thus,
hopefully, reduced bandwidth). While texture animations are broken, this
does mark a significant milestone towards implementing shadows as it
should now be possible to use multiple threads (with multiple index and
entid buffers) to render the depth buffers for all the lights.
This allows the use of an entity id to index into the entity data and
fetch the transform and colormod data in the vertex shader, thus making
instanced rendering possible. Non-world brush entities are still not
rendered, but the world entity is using both the entity data buffer and
entid buffer.
Sub-models and instance models need an instance data buffer, but this
gets the basics working (and the proof of concept). Using arrays like
this actually simplified a lot of the code, and will make it easy to get
transparency without turbulence (just another queue).
The gl water warp ones have been useless since very early on due to not
doing water warp in gl (vertex warping just didn't work well), and the
recent water warp implementation doesn't need those hacks. The rest of
the removed flags just aren't needed for anything. SURF_DRAWNOALPHA
might get renamed, but should be useful for translucent bsp surfaces
(eg, vines in ad_tears).
One more step towards BSP thread-safety. This one brought with it a very
noticeable speed boost (ie, not lost in the noise) thanks to the face
visframes being in tightly packed groups instead of 128 bytes apart,
though the sw render's boost is lost in the noise (but it's very
fill-rate limited).
This is next critical step to making BSP rendering thread-safe.
visframe was replaced with cluster (not used yet) in anticipation of BSP
cluster reconstruction (which will be necessary for dealing with large
maps like ad_tears).
The main goal was to get visframe out of mnode_t to make it thread-safe
(each thread can have its own visframe array), but moving the plane info
into mnode_t made for better data access patters when traversing the bsp
tree as the plane is right there with the child indices. Nicely, the
size of mnode_t is the same as before (64 bytes due to alignment), with
4 bytes wasted.
Performance-wise, there seems to be very little difference. Maybe
slightly slower.
The unfortunate thing about the change is the plane distance is negated,
possibly leading to some confusion, particularly since the box and
sphere culling functions were affected. However, this is so point-plane
distance calculations can be done with a single 4d dot product.
The map uses 41% of a 4k light map scrap, and 512 texture descriptors
wasn't enough for vulkan. Ouch. I do need to get cvars on these things,
but this will do for now (decades later...)
Sounds in Arcane Dimensions (at least those used by ad_tears) specify
start and end cue points. The code was using only the final point in the
list and thus breaking looped sounds. Now, the first cue point is used
as the loop start, and the second (if present), the sample length. Both
are bounds-checked against the wav's sample count. Fixes sound locking
up during the first seconds in ad_tears.
This one is ancient: the code was essentially unmodified since release
(just some formatting). Malformed vectors could sneak through due to map
bugs (eg, "angles -90" instead of "angle -90" as in ad_tears) and the
vector parsing code would continue past the end of the string and
writing into unowned memory, potentially messing up the libc allocation
records. Replacing with the obvious sscanf works nicely.
Sometimes, Quake code is brilliant. Other times, it's a real face-palm.
This fixes the annoying persistence of inputs when respawning and
changing levels. Axis input clearing is hooked up but does nothing as of
yet. Active device input clearing has always been hooked up, but also
does nothing in the evdev and x11 drivers.