This makes it possible for hierarchies to clean themselves up (by
deleting their entities (though that will cause other problems later
when the hierarchy doesn't own the entities)), thus plugging a memory
leak when parsing passage text.
This makes it possible for hierarchies to clean themselves up (by
deleting their entities (though that will cause other problems later
when the hierarchy doesn't own the entities)), thus plugging a memory
leak when parsing passage text.
The main goal of this change was to make it easier to tell when a
hierarchy has been deleted, but as a side benefit, it got rid of the use
of PR_RESMAP. Also, it's easy to track the number of hierarchies.
Unfortunately, it showed how brittle the component side of the ECS is
(scene and canvas registries assumed their components were the first (no
long the case), thus the sweeping changes).
Centerprint doesn't work (but it hasn't for a while).
It's used for finding the entity that has the actual canvas component
attached. Useful for sharing a single canvas between multiple view
hierarchies, and worked as a proof of concept for doing similar with
hierarchy references, and might work for properly destroying canvas
items (fills etc) when a view entity is deleted (if attached to every
view).
Rather than creating and destroying one every call. Didn't make any real
difference to the memory leaks, but it should make the calls
fractionally faster and reduce fragmentation.
Every use of va_copy needs a corresponding call to va_end. I had somehow
missed that when getting _dvsprintf to work properly. This seems to plug
a memory leak (certainly doesn't make things worse).
This seems to be the best solution for interlinked entities/components,
the idea being that components with higher indices can "own" those with
lower (eg, imui_reference can "own" a view_href, but not the other way)
and makes it relatively easy to manage (components that can own others
get added to the registry later), and might even allow validation at a
later stage.
I'm not entirely sure what's going on yet, but deleting the referenced
view via its entity rather than the view results in a corrupted href in
the component pool (with a null entity id in the dense array) and then
an href component leak (as well as some very weird numbers when dumping
canvas bounds). I suspect Hierref_DestroyComponent is missing a few
steps (though I do need to verify that it's getting called in this
particular case).
It was a little off-putting getting an incorrectly clipped console when
using non-unitary scale (especially since I was trying to show abbator
something).
This is for scroll boxes (the nesting of canvases is for the clipping
they provide). There are some issues with automatic layout, but this
gets things mostly working, in particular the management of the link
between hierarchies as a canvas is always the root of its hierarchy.
With the scroll box work I'm doing, I realized 16 bits is a little
cramped. Although I doubt it would be that much of a problem, switching
to 32 bits turned out to be free because of alignment.
Seems to work well. The other renderers have stubs because I don't feel
like implementing clipping for them. The gl and glsl wouldn't be too
difficult (need to handle the draw queues), but sw needs a fair bit of
work and I'm not sure it's worth the effort.
Much of the state handling was highly redundant (in particular, handling
entity and old_entity). This should make it easier to get dragable items
for window resizing.
It now checks the next block to see if it is free with enough space and
carves off a chunk if so, or chops off the end of the current block if
smaller, otherwise it allocates *before* freeing.
This makes a possible improvement to e1m3, only barely affects ad_tears,
but makes about 30% difference to gmsp3v2 (21fps to 27, and from 3300
leafs to 2700).
The cascade_shadow and cube_shadow names are no longer relevant thanks
to the staging images, and the output field for render passes is
optional in general and irrelevant for shadow maps.
The rendering of the shadow maps now takes the culling information into
account resulting in a drastic reduction of work. There's still more
work to be done, but demo1 peaks at over 1000fps at 640x480, gmsp3v2 now
gets 14fps (1920x1080) near the front gate (used to be 3, then 6),
ad_tears is up to 3fps, but marcher is still unhappy, but it has
infinite radius lights, so needs more culling work (clipped light
volumes will help, I think). Also, culling lights for which nothing has
moved within their volumes will help somewhat (though not as much for
most id maps, I suspect).
Using the translucency pass made it easy to have depth-tested
translucent "solid" light volumes instead of always visible lines (which
are still an option as that's useful too). Most importantly, being able
to see the surfaces helped no end in figuring out that my hulls were
created with counter-clockwise windings instead of quake's usual
clockwise windings and thus my hulls were being rendered inside-out in
the occlusion pass.
The results of the occlusion queries give the lights that don't have a
visible hull, but unfortunately that includes any lights which the
camera is inside, but simple distance checks sort that out (with a
fudge-factor for the icosahedron vertices (1.583 (3(2+p)/(2+3p), p is
golden ratio)).
My efforts (especially the collect zone (what was I thinking)) got
tracy's knickers in a twist resulting in vanishing zones in the server.
It looks like there are some synchronisation issues between cpu and gpu,
but I'm not *too* worried about it at this stage.
The info isn't used yet, but this shows that vulkan's occlusion queries
are at least somewhat useful. However, the technique isn't perfect:
infinite radius lights (1/r and 1/r^2) are difficult to cull, and all
lights can poke through thin enough walls, and then lights containing
the camera get culled incorrectly (will need a separate test). Still, it
looks like it will help once everything is tied together.
And make it callable directly (needed to be able to submit the command
buffer separately from the main commands (though this does mess with
tracy a little).
They weren't rendering properly at all due to the matrix updates getting
overwritten by the light data (I'd forgotten to advance the packet data
pointer).
This doesn't make much of a difference on the GPU, but it drastically
cuts down CPU usage, especially for ad_tears: shadow map drawing is down
from 16.3ms to 3.7ms thanks to no having to run the alias model queues
as often.
Batching shadow map rendering needs be able to reference matrices for
multiple lights in a single batch, but the only input is the view index,
so use that to look up the matrix index rather than using it to index
the matrices directly (modulo the base index that's still there).
Actually, only 29 are used because nvidia's drivers segfault when there
are more than 29 views (regardless of the exact bit pattern in the view
mask). This will allow rendering shadow maps in large batches, which
should make for better GPU utilization.
Even that's getting pretty big, but with the quanta at 128, that's a
maximum of 8 different image sizes (which is nice for my planned
"staging image" idea).
Interestingly, this caused a reduction in memory use for some maps (but
did increase marcher's again, but not as much as the bogus rounding
did). The idea was to use sparse bindings to remap shadow map layers,
but it turns out sparse bindings are insanely slow (beyond unusable).
However, the reduction in the number of shadow map images seems to be
worth it.
Since switching to the 1.2 api as a requirement, might as well use the
relevant structs instead of extension struct (for multiview). Came up
when double-checking the max views property due to running into what
appears to be an nvidia bug where > 29 views (any bit pattern) cause a
segfault when creating the pipeline.
I had missed that upping max lights to 2048 meant that up to 12288
matrices are needed for all the possible lights. This made it so the
light type could not be encoded in id_data, but the shaders never used
it anyway. This leaves one bit free.
I'd added some developer output to see how the layers were distributed
between images and found the image widths to be... odd. It turns out I
was double-adding the shadow_quanta. Oops. Results in ~164MB less memory
used by marcher (for 32 pixel quanta).
This allows "large" updates to be done in a single staging buffer packet
instead of one packet per quad (or slice). Currently, they're batched
into groups of 64 (not really enough for conchars, but that's only at
init-time, so not all that bad). Nicely, this seems to simplify the
staging code.
Fixes#65.
When looking at a struct and seeing "count" and "size", I had to hunt to
see what "size" really meant. Cherno is very much right about size vs
count being bytes vs number of objects.
load_conchars and load_crosshairs were using create_quad directly (due
to make_static_quad having the wrong parameters), but this spread the
handling of which buffer and index where used through the code. Thus fix
make_static_quad to take the x, y offsets (like make_dyn_quad) and then
use it in load_conchars and load_crosshairs.
While QFV_PacketScatterBuffer works on only one destination buffer, it
turns out it's still useful for scattering to multiple buffers, just
with multiple calls. This makes it pretty easy to combine multiple
buffer updates into a single staging buffer packet, resulting in
reducing lighting's packet use from up to 7 to just one, drastically
reducing the pressure on the stating buffer packet pool, and thus
reducing the chances of QFV_PacketAcquire stalling.
This relies on my fork of tracy: https://github.com/taniwha/tracy
on the wip-c-vulkan branch. Everything is still rather flaky though.
This necessitated the jump to vulkan 1.2 as a requirement.
This gets the dynamic data closer to the gpu, so should make a
difference when there's a lot going on. However, for simple tests, it
made no difference.
This allows tracy to clean up properly. However, Sys_Quit will use the
jump buffer (sys_exit_jmpbuf) only if it has been set, so the use of
Sys_setjmp is optional.
I'm still not happy with it being a compile time constant, but this
takes care of the interlock between frames in flight... for now: it's
fragile and really needs the excessive small-packet use in draw and
lighting to be cleaned up.
After discussion with Darian, I've decided to go with one big staging
buffer (with lots of packets) shared between FiF as the large size will,
in the end, be more flexible.
Host_Error and Host_EndGame use setjmp/longjmp to implement an exception
of sorts, but this messes with tracy's state even with cleanup
attributes. However, it turns out that those cleanup attributes are
exactly how gcc implements C++ destructors, and so the standard Unwind
api (part of libgcc) respects them (so long as -fexceptions is enabled
for C). Thus... replace longjmp with an implementation that uses Unwind
to unwind the stack and call the cleanup functions as needed. This is
actually important for more than just tracy as the cleanup attributed
vars can be thread locks.
Tracy is a frame profiler: https://github.com/wolfpld/tracy
This uses Tracy's C API to instrument the code (already added in several
places). It turns out there is something very weird with the fence
behavior between the staging buffers and render commands as the
inter-frame delay occurs in a very strangle place (in the draw code's
packet acquisition rather than the fence waiter that's there for that
purpose). I suspect some tangled dependencies.
Mostly just macro conflicts (and a little white space in passing).
Commits for integrating tracy will come later when I've come up with a
wrapper-api that I like (so non-tracy builds are easy even with tracy
available).
This fixes the weird slug when running nq on windows. It turns out it
was the "friendly neighbor" sleep code activating due to bitrot. In
addition, there are cvars for enabling unfocused sleep (defaults off)
and disabling minimized sleep (defaults on).