Sounds odd, but it's part of the problem with calling two different
things with essentially the same name. The "high level" render pass in
question may be a compute pass, or a complex series of (Vulkan) render
passes and so won't create a Vulkan render pass for the "high level"
render pass (I do need to come up with a better name for it).
I really don't remember why I made it separate, though it may have been
to do with r_ent_queue. However, putting it together with the rest is
needed for the "render pass" rework.
It now lives in vulkan_renderpass.c and takes most of its parameters
from plist configs (just the name (which is used to find the config),
output spec, and draw function from C). Even the debug colors and names
are taken from the config.
QFV_CreateRenderPass is no longer used, and QFV_CreateFramebuffer hasn't
been used for a long time. The C file is still there for now but is
basically empty.
The real reason for the delay in implementing support for pNext is I
didn't know how to approach it at the time, but with the experience I've
gained using and modifying vkparse, the solution turned out to be fairly
simple. This allows for the use of various extensions (eg, multiview,
which was used for testing, though none of the hookup is in this
commit). No checking is done on the struct type being valid other than
it must be of a chainable type (ie, have its own pNext).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Enabled by 'developer lighting'. It was good for confirming that the
lights in ad_e1m1 (Doom Hangar 16) were actually being output (over 600
of them sometimes, ouch). Turned out to be the color scale ambiguity.
The pitch cvars are taken from quakespasm because I ran into a button I
couldn't shoot with the 80 degree limit, but I figured I'd add roll
limits while I was at it.
Surfaces marked with SURF_DRAWALPHA but not SURF_DRAWTURB are put in a
separate queue for the water shader and run with a turb scale of 0.
Also, entities with colormod alpha < 1 are marked to go in the same
queue as SURF_DRAWALPHA surfaces (ie, no SURF_DRAWTURB unless the
model's texture indicated such).
Textures whose names start with a { are meant to be rendered with
transparency. Surfaces using those textures are marked with
SURF_DRAWALPHA.
Unfortunately, the mip levels of ad_tears' transparent textures use the
wrong color so only the highest LOD works properly, but those textures
are meant to be loaded from external files anyway, it seems.
A listener is used instead of (really, as well as) ie_app_window events
because systems that need to know about windows sizes may not have
anything to do with input and the event system.
This breaks console scaling for now (con_width and con_height are gone),
but is a major step towards window resize support as console stuff
should never have been in viddef_t in the first place.
The client screen init code now sets up a screen view (actually the
renderer's scr_view) that is passed to the client console so it can know
the size of the screen. The same view is used by the status bar code.
Also, the ram/cache/paused icon drawing is moved into the client screen
update code. A bit of duplication, but I do plan on merging that
eventually.
Things seem to be at least close to the right place now.
Input line handling has been made more object-oriented in that the
collection of objects required for a single input line (command, say,
say_team) are bundled into one object with just one set of handlers for
resize and draw. Much tidier.
view_new sets the geometry, but any setgeometry that need a valid data
pointer would get null. It might be better to always have the data
pointer, but I didn't feel like doing such a change at this stage as
there are quite a lot of calls to view_new. Thus view_new_data which
sets the data pointer before calling setgeometry.
More tuning is needed on the actual splits as it falls over when the
lower rect gets too low for the subrects being allocated. However, the
scrap allocator itself will prefer exact width/height fits with larger
cutoff over inexact cuts with smaller cutoff. Many thanks to tdb for the
suggestions.
Fixes the fps dropping from ~3700fps down to ~450fps (cumulative due to
loss of POT rounding and very poor splitting layout), with a bonus boost
to about 4900fps (all speeds at 800x450). The 2d sprites were mostly ok,
but the lightmaps forming a capital gamma shape in a 4k texture really
hurt. Now the lightmaps are a nice dense bar at the top of the texture,
and 2d sprites are pretty good (slight improvement coming next).
This replaces old_console_t with con_buffer_t for managing scrollback,
and draw_charbuffer_t for actual character drawing, reducing the number
of calls into the renderer. There are numerous issues with placement and
sizing, but the basics are working nicely.
I really don't know why I tried to do ring-buffers without gaps, the
code complication is just not worth the tiny savings in memory. In fact,
just the switch from pointers to 32-bit offsets saves more memory than
not having gaps (on 64-bit systems, no change on 32-bit).
I've decided that appending to a full single-line buffer should simply
scroll through the existing text. Unsurprisingly, the existing code
doesn't handle the situation all that well. While I've already got a fix
for it, I think I've got a better idea that will handle full buffers
more gracefully.
This fixes the current line object getting corrupted by the tail line
update when the buffer is filled with a single line. There are probably
more tests to write and bugs to fix :)
I was looking through the code for Con_BufferAddText trying to figure
out what it was doing (answer: ring buffer for both text and lines) and
got suspicious about its handling of the line objects. I decided an
automated test was in order. It turns out I was right: filling the
buffer with a single long line causes the tail line to trample the
current line, setting its pointer and length to 0 when the final
character is put in the buffer.
It handles basic cursor motion respecting \r \n \f and \t (might be a
problem for id chars), wraps at the right edge, and automatically
scrolls when the cursor tries to pass the bottom of the screen.
Clearing the buffer resets its cursor to the upper left.
QFS_LoadFile closes its file argument (this is a design error resulting
from changing QFS_LoadFile to take a file instead of a path and not
completing the update), resulting in the call to Qfilesize accessing
freed memory.
This is intended for the built-in 8x8 bitmap characters and quake's
"conchars", but could potentially be used for any simple (non-composed
characters) mono-spaced font. Currently, the buffers can be created,
destroyed, cleared, scrolled vertically in either direction, and
rendered to the screen in a single blast.
One of the reasons for creating the buffer is to make it so scaling can
be supported in the sw renderer.
PR_Debug_ValueString prints the value at the given offset using the
provided type to format the string. The formatted string is appended to
the provided dstring.
PR_Debug_ValueString prints the value at the given offset using the
provided type to format the string. The formatted string is appended to
the provided dstring.
For whatever reason, building under MXE (for windows) causes FLAC to try
to use dll import references, but setting FLAC__NO_DLL before including
FLAC/export.h fixes the issue.
For whatever reason, building under MXE (for windows) causes FLAC to try
to use dll import references, but setting FLAC__NO_DLL before including
FLAC/export.h fixes the issue.
While this does pull the grovelling for the subpic out to the callers,
the real problem is the excessive use of qpic_t in the internal code:
qpic_t is really just the image format in wad files, and shouldn't be
used as a generic image handle.
Cleans up more of the icky code in the font drawing functions.
This makes working with quads, implied alpha quads, and lines much
cleaner (and gets rid of the bulk of the "eww" fixme), and will probably
make it easier to support multiple scraps and fonts, and potentially
more flexible ordering between pipelines.
-describe is sent to the object, and the returned string passed back.
There is a worry about the lifetime of the returned string as there's
currently no way of both ensuring it doesn't get freed prematurely and
ensuring it does eventually get freed.
If no handler has been registered, then the corresponding parameter is
printed as a pointer but with surrounding brackets (eg, [0xfc48]). This
will allow the ruamoko runtime to implement object printing.
-describe is sent to the object, and the returned string passed back.
There is a worry about the lifetime of the returned string as there's
currently no way of both ensuring it doesn't get freed prematurely and
ensuring it does eventually get freed.
If no handler has been registered, then the corresponding parameter is
printed as a pointer but with surrounding brackets (eg, [0xfc48]). This
will allow the ruamoko runtime to implement object printing.
This means that QF should support more exotic fonts without any issue
(once the rest of the text handling system is up to snuff) as HarfBuzz
does all the hard work of handling OpenType, Graphite, etc text shaping,
including kerning (when enabled).
Also, font loading now loads all the glyphs into the atlas (preload is
gone).
While the results are a little surprising (tends to alternate between
left side and top for allocations), there is much less wasted space in
the partially allocated regions, and the main free region seems to
always be quite big.
While VRect_Difference worked for subrect allocation, it wasn't ideal as
it tended to produce a lot of long, narrow strips that were difficult to
reuse and thus wasted a lot of the super-rectangle's area. This is
because it always does horizontal splits first. However, rewriting
VRect_Difference didn't seem to be the best option.
VRect_SubRect (the new function) takes only width and height, and splits
the given rectangle such that if there are two off-cuts, they will be
both the minimum and maximum possible area. This does seem to make for
much better utilization of the available area. It's also faster as it
does only the two splits, rather than four.
It is currently an ugly hack for dealing with the separate quad queue,
and the pipeline handling code needs a lot of cleanup, but it works
quite well, though I do plan on moving to HarfBuzz for text shaping. One
nice development is I got updating of descriptor sets working (just need
to ensure the set is no longer in use by the command queue, which the
multiple frames in flight makes easy).
It's implemented only in the Vulkan renderer, partly because there's a
lot of experimenting going on with it, but the glyphs do get transferred
to the GPU (checked in render doc). No rendering is done yet: still
thinking about whether to do a quick-and-dirty test, or to add HarfBuzz
immediately, and the design surrounding that.
R8G8B8A8 was hard-coded by accident when creating Vulkan_LoadTexArray
(or probably even the original Vulkan_LoadTex). This wasn't a problem
while everything was loaded in that format, but attempting to load an R8
texture didn't go so well. The same format as the image itself is used
now (correctly so).
I have recently learned that pre-multiplied alpha is the correct way to
do compositing, which is pretty much what the 2d pass does (actually,
all passes, but...). This required ensuring the color factor passed to
the fragment shader is pre-multiplied (a little silly for cshifts as
they used to be pre-multiplied but were un-pre-multiplied early in QF's
history and I don't feel like fixing that right now as it affects all
renderers), and also pre-multiplying alpha when converting from 8-bit
palette to rgba as the palette entry for transparent has that funky pink
(which is used in full-brights).
I will need to do more work to improve the 2d allocation, but rounding
up the requested sizes to the next power of two proved to be excessively
wasteful: I was able to allocate spots for only half of the sub-pics I
needed (though I did still need to double the number of pixels in the
end).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Due to the mis-initialization of the union used to parse the color
vector, the intensity was incorrectly set to zero and thus the light
dropped, meaning that all lights in ad_tears were lost.
The extend instruction is for loading narrower data types into wider
data types, eg, single element into 2, 3, or 4 element types, with a
small set of extension schemes: 0, 1, -1, copy (for 1->any and 2 -> 4).
Possibly most importantly, it works with unaligned data.
Progress towards #30
Most were pretty easy and fairly logical, but gib's regex was a bit of a
pain until I figured out the real problem was the conditional
assignments.
However, libs/gamecode/test/test-conv4 fails when optimizing due to gcc
using vcvttps2dq (which is nice, actually) for vector forms, but not the
single equivalent other times. I haven't decided what to do with the
test (I might abandon it as it does seem to be UD).
This gets ambient sounds (in particular, water and sky) working again
for quakeworld after the recent sound changes, and again for nq after I
don't know how long.
Because the calculation didn't take the hunk header size (which is not
included in the hunk size) into account, the conversion to MB was one
short and thus the rounding up to the next 8 MB boundary was giving the
current total hunk size (ie, the already given size). Most confusing to
a user ("But I already asked for 128MB!").
It turns out that copying just "unknown" is a significant performance
hit when doing over 100M allocations. Making Hunk_RawAlloc the core and
initializing the name field with a single 0 shaved about a second off
`qfvis gmsp3v2.bsp` (from about 39s to about 38s).
My reason for using Hunk_HighAlloc for allocating cache blocks was to
lock them down so they were safe for the sound mixer to access when
running in a real-time thread. However, I had never tested under tight
memory constraints, which proved that the design (or maybe just
implementation) just wasn't robust. However, now that sounds are loaded
into a completely separate region, it's safe to put the cache back to
its original behaviour (still with 64-byte alignment and such, of
course). This will even allow the high hunk to be used again, though it
effectively was anyway with Hunk_TempAlloc.
I never liked "cache" as a name because it said where the sound was
stored rather than how it was loaded/played, but "stream" is ok, since
that's pretty much spot on. I'm not sure "block" is the best, but it at
least makes sense because the sounds are loaded as a single block (as
opposed to being streamed). An now, neither use the cache system.
Nuclear powered audio ;)
More seriously, use _Atomic on a few fields that very obviously need it.
That is, channel's buffer pointer (used to signal to the mixer that the
channel is ready for use) and "flow control" flags (stop, done and
pause), and head and tail in the buffer itself. Since QF has been
working without _Atomic (admittedly, thanks to luck and x86's strong
memory model), this should do until proven otherwise. I imagine getting
stream reading out of the RT thread will highlight any issues.
Turned out the channels simply weren't being freed by SND_ScanChannels
when they should have been (probably a good thing, too, as it wasn't
being told to wait for the mixer).
Care needs to be taken when freeing channels as doing so while an
asynchronous mixer is using them is unlikely to end well. However,
whether the mixer is asynchronous depends on the output driver. This
lets the driver inform the rest of the system that the output and mixer
are running asynchronously.
SYS_dev is a holdover from when we had only the one flag and is not
meant to be used for tests (I seem to remember mentioning an audit was
necessary, but obviously forgotten). One step at a time, I guess :)
This improves the locality of reference when mixing and removes the
proxy sfx for streamed sounds.
The buffer for streamed sounds is allocated when the stream is opened
(since streamed sounds can't share buffers), and freed when the stream
is closed.
For block sounds, the buffer is reference counted (with the sfx holding
one reference, so currently block buffers never get freed), with their
reference count getting incremented on open and decremented on close.
That the reference counts get to 1 has been confirmed, so all that
should be needed is proper destruction of the sfx instances.
Still need to sort out just why channels leak across level changes.
Getting the tag is possibly useful in general and definitely in
debugging. Setting, I'm not so sure as it should be done when allocated,
but that's not always possible.
Also, correct the return type of z_block_size, though it affected only
Z_Print. While an allocation larger than 4GB is... big for zone, the
blocks do support it, so printing should too.
They're currently treated as non-fatal, those sounds just won't ever
play. This allows ad_tears to at least load with only 32MB of locked
memory (it needs somewhere between 64 and 96).
Since Ruamoko got vector types, zone's 8-byte alignment was no longer
sufficient due to hardware-enforced alignment requirements of the
underlying vector operations.
Fixes#28.
And use it for Ruamoko object reference counts.
I need reference counts for dealing with block sound buffers since they
can be shared by many channels. I figured I take care of Ruamoko's
reference count location at the same time.
Fixes#27.
Sounds no longer use the cache, which is good for multi-threaded, but a
pain for memory management: the buffers are shared between channels that
play back the sounds, but when the sounds were cached, they were
automagically (thus problematically) freed when the space was needed.
That no longer happens, so they leak. I think the solution is to use
reference counting and retain/release in sfx->open() and sfx->close().
Streams are the easy one as they were never in the cache. As a side
effect, sfxstream_t is much smaller as it no longer has the buffer
embedded in the struct.
SND_AllocChannel is a little too aggressive in freeing channels that
have finished as the channel may be externally owned (eg, by cd_file).
Get bgm looping working again.
More shrinkage. It turned out the mixer uses the phase fields, so they
couldn't be removed, but even at 192kHz, +/- 127 samples produces
sufficient phase separation for a 21cm head (which is, actually, pretty
big: mine is about 15cm across), but that change can come later.
The ambient sound loading has been removed from snd_channels because 1)
it doesn't work for nq, 2) it should never have been there in the first
place (it belongs in the client, but that needs some more API).
This is part of a process to shrink channel_t so it doesn't waste locked
memory when it gets moved there. Eventually, only the fields the mixer
needs will be in channel_t itself: those needed for spacialization will
be moved into a separate array.
In the process, I found that channels leak across level changes, but
this appears to be due to the cached sounds being removed during loading
and the mixer never marking them as done (it sees the null sfx pointer
and assumes the channel was never in use). Having the mixer mark the
channel as done seems to fix the leak, but cause a free channel list
overflow. Rather than fight with that, I'll leave the leak for now and
fix it at its root cause: the management of the sound samples
themselves.
Sys_DoubleTime starts at 4Gs in order to keep its precision fixed for a
nice long time (about 120 years, iirc).
This fixes an instant watchdog trigger when first starting up in
testsound. I'm not sure why it didn't happen with nq, but I guess that
doesn't really matter
The scaling up of the volumes when setting a channel's volume bothered
me. The biggest issue being it hasn't been necessary for over a decade
since the conversion to a float-mixer. Now the volume and attenuation
scaling from protocol bytes is entirely in the client's hands.
This does mean that the gl and sw renderers can no longer call
S_ExtraUpdate, but really, they shouldn't be anyway. And I seem to
remember it not really helping (been way too long since quake ran that
slowly for me).
sfx_t is now private, and cd_file no longer accesses channel_t's
internals. This is necessary for hiding the code needed to make mixing
and channel management *properly* lock-free (I've been getting away with
murder thanks to x86's strong memory model and just plain luck with
gcc).
The tests fail as they exercise how the cache *SHOULD* work rather than
how it does now.
The tests do currently pass for the pending work I've done on the cache
system, but while working on it, I remembered why I reworked cache
allocation...
The essential problem is that sounds are loaded into the cache, which is
fine for synchronous output targets, but has proven to be a minefield
for asynchronous output targets (JACK, ALSA).
The reason for the minefield is the hunk takes priority over the cache,
and is free to move cache blocks around, and *even dispose of them
entirely* in order to satisfy memory allocations from either end of the
hunk. Doing this in an entirely single-threaded process (as DOS Quake
was) is perfectly safe, as the users of the cache just reload the
pointer each time, and bail if it's null (meaning the block has been
freed), or even cause the data to be reloaded if possible (I'm a little
fuzzy on the details for that as I didn't write that code). However, in
multi-threaded code, especially real-time (JACK, possibly ALSA), it's a
recipe for disaster. The 4cab5b90e6 commit was a (mostly) successful
attempt to mitigate the problem by allocating the cache blocks from the
high-hunk (thus minimizing any movement caused by low-hunk allocations),
it resulted in cache allocates and regular high-hunk allocations somehow
getting intertwined: while investigating just how much memory ad_tears
needs (somewhere between 192MB and 256MB), I got "trashed sentinel"
errors and upon investigation, I found what looks very suspiciously like
audio data written across a hunk control block.
I've decided that the cache allocation *algorithm* should be reverted to
how it was originally designed by Id (details will remain "modern"), but
while working on the tests, I remembered why I had done the changes in
the first place (above story). Thus the work on reverting the cache
allocation can't go in until I get sound memory management independent
of the cache. The tests are going in now so I have a constant reminder :)
And make Sys_MaskPrintf take the developer enum rather than just a raw
int.
It was actually getting some nasty hunk corruption errors when under
memory pressure that made it clear the sound system needs some work.
I had been trimming for the solid leaf, but not the empty leafs. I had
assumed the vis tool would trim the bits, but it seems to not be
reliable (though it could be a bug in qfvis, I think the map in question
is one of my test maps).
The texture animation data is compacted into a small struct for each
texture, resulting in much less data access when animating the texture.
More importantly, no looping over the list of frames. I plan on
migrating this to at least the other hardware renderers.
I found a test map with no texture data. Even after fixing the bsp
loader, vulkan didn't like it. Now vulkan is happy. The "Missing" text
is full-bright magenta on a dim grey background so it should be visible
in any lighting conditions.
Conflagrant Rodent has a sub-model with 0 faces (double bit error?)
causing simply counting faces to get out of sync with actual model
starts thus breaking *all* brush models that come after it (including
other maps). Thus be a little less lazy in figuring out model start
faces.
The models are broken up into N sub-(sub-)models, one for each texture,
but all faces using the same texture are drawn as an instance, making
for both reduced draw calls and reduced index buffer use (and thus,
hopefully, reduced bandwidth). While texture animations are broken, this
does mark a significant milestone towards implementing shadows as it
should now be possible to use multiple threads (with multiple index and
entid buffers) to render the depth buffers for all the lights.
This allows the use of an entity id to index into the entity data and
fetch the transform and colormod data in the vertex shader, thus making
instanced rendering possible. Non-world brush entities are still not
rendered, but the world entity is using both the entity data buffer and
entid buffer.
Sub-models and instance models need an instance data buffer, but this
gets the basics working (and the proof of concept). Using arrays like
this actually simplified a lot of the code, and will make it easy to get
transparency without turbulence (just another queue).
The gl water warp ones have been useless since very early on due to not
doing water warp in gl (vertex warping just didn't work well), and the
recent water warp implementation doesn't need those hacks. The rest of
the removed flags just aren't needed for anything. SURF_DRAWNOALPHA
might get renamed, but should be useful for translucent bsp surfaces
(eg, vines in ad_tears).
One more step towards BSP thread-safety. This one brought with it a very
noticeable speed boost (ie, not lost in the noise) thanks to the face
visframes being in tightly packed groups instead of 128 bytes apart,
though the sw render's boost is lost in the noise (but it's very
fill-rate limited).
This is next critical step to making BSP rendering thread-safe.
visframe was replaced with cluster (not used yet) in anticipation of BSP
cluster reconstruction (which will be necessary for dealing with large
maps like ad_tears).
The main goal was to get visframe out of mnode_t to make it thread-safe
(each thread can have its own visframe array), but moving the plane info
into mnode_t made for better data access patters when traversing the bsp
tree as the plane is right there with the child indices. Nicely, the
size of mnode_t is the same as before (64 bytes due to alignment), with
4 bytes wasted.
Performance-wise, there seems to be very little difference. Maybe
slightly slower.
The unfortunate thing about the change is the plane distance is negated,
possibly leading to some confusion, particularly since the box and
sphere culling functions were affected. However, this is so point-plane
distance calculations can be done with a single 4d dot product.
The map uses 41% of a 4k light map scrap, and 512 texture descriptors
wasn't enough for vulkan. Ouch. I do need to get cvars on these things,
but this will do for now (decades later...)
Sounds in Arcane Dimensions (at least those used by ad_tears) specify
start and end cue points. The code was using only the final point in the
list and thus breaking looped sounds. Now, the first cue point is used
as the loop start, and the second (if present), the sample length. Both
are bounds-checked against the wav's sample count. Fixes sound locking
up during the first seconds in ad_tears.
This one is ancient: the code was essentially unmodified since release
(just some formatting). Malformed vectors could sneak through due to map
bugs (eg, "angles -90" instead of "angle -90" as in ad_tears) and the
vector parsing code would continue past the end of the string and
writing into unowned memory, potentially messing up the libc allocation
records. Replacing with the obvious sscanf works nicely.
Sometimes, Quake code is brilliant. Other times, it's a real face-palm.
This fixes the annoying persistence of inputs when respawning and
changing levels. Axis input clearing is hooked up but does nothing as of
yet. Active device input clearing has always been hooked up, but also
does nothing in the evdev and x11 drivers.
It was added only because FitzQuake used it in its pre-bsp2 large-map
support. That support has been hidden in bspfile.c for some time now.
This doesn't gain much other than having one less type to worry about.
Well tested on Conflagrant Rodent (the map that caused the need for
mclipnode_t in the first place).
This was one of the biggest reasons I had trouble understanding the bsp
display list code, but it turns out it was for dealing with GLES's
16-bit limit on vertex indices. Since vulkan uses 32-bit indices,
there's no need for the extra layer of indirection. I'm pretty sure it
was that lack of understanding that prevented me from removing it when I
first converted the glsl bsp code to vulkan (ie, that 16-bit indices
were the only reason for elements_t).
It's hard to tell whether the change makes much difference to
performance, though it seems it might (noisy stats even over 50 timedemo
loops) and the better data localization indicate it should at least be
just as good if not better. However, the reason for the change is
simplifying the data structures so I can make bsp rendering thread-safe
in preparation for rendering shadow maps.
And maybe a nano-optimization. Switching from (~side + 1) to (-side)
seems to give glsl a very tiny speed boost, but certainly doesn't hurt.
Looking at some assembly output for the three cases, the two hacks seem
to generate the same code as each other, but 3 instructions vs 6 for ?:.
While ?: is more generically robust, the hacks are tuned for the
knowledge side is either 0 or 1. The later xor might alter things, but
at least I now know that the hack (either version) is worthwhile.
This is a particularly ancient bug, sort of introduced by rhamph when he
optimized temp entity model handling and later exacerbated by me.
However, I suspect the actual problem is limited to nq as qw's gamedir
handling would have caused the models to be reloaded, but nq doesn't
ever change game directories once running.
With experience, I have found that trying to continue after a validation
error tends to result in a segfault or some other nastiness, and
Sys_Shutdown (and the full shutdown sequence) is triggered for any error
signal (segfault, abort, etc) so just exit(1).
Although the skin pointer was being advanced after recording the
information in for the batch array, it was being reset the next time
around the loop (due to a mistranslation of the previous code). This
fixes the segfault while loading (gl, glsl, vulkan) or rendering (sw)
the sphere model from Rogue.
Some very much needed comments :P Still, nicely, I now have a much
better understanding of how the display lists are created (10 years
is a long time to remember how intricate code works (I do remember
fighting to get it working back then))
This makes it much easier to see just what is being done to build a
polygon to be passed to the GPU, and it served as a test for the
lightmap st changes since Vulkan currently never used them.
Many modders use negative lights for interesting effects, but vulkan
doesn't like the result of a negative int treated as unsigned when it
comes to texture sizes.
However, this time it doesn't modify the light array when it sorts the
lights by size since the lights are now located before the renderer gets
to see them, and having the fix up the light leafs array would be too
painful (and probably the completely wrong thing to do anyway: the light
array should be treated as constant by the renderer). 1.6GB of memory
for gmsp3v2's lights (a little better than marcher: more smaller lights?).
For reference:
gmsp3v2: shadow maps: 8330 layers in 29 images: 1647706112
marcher: shadow maps: 2440 layers in 11 images: 2358575104
While it wasn't the root cause of the disappearing lights (even after
sorting out the light limit issue), because the cause of that was
everything working as designed, I suspect sunlight wasn't reaching as
far as it should. Even it it was, this should be slightly faster
(especially for larger maps) as leafs can be tested 32 or 64 at a time
rather than individually.
For now, at least (I have some ideas to possibly reduce the numbers and
also to avoid the need for actual limits). I've seen gmsp3v2 use over
500 lights at once (it has over 1300), and I spent too long figuring out
that weird light behavior was due to the limit being hit and lights
getting dropped (and even longer figuring out that more weird behavior
was due to the lack of shadows and the world being too bright in the
first place).
Moving the negation into the calculation of the sun angle prevents -0
getting into the vector (not that it makes much difference other than
minor confusion when reading the light data).
Since the staging buffer allocates the command buffers it uses, it
needs to free them when it is freed. I think I was confused by the
validation layers not complaining about unfreed buffers when shutting
down, but that's because destroying the pool (during program shutdown,
when the validation layers would complain) frees all the buffers. Thus,
due to staging buffers being created and destroyed during the level load
process, (rather large) command buffers were piling up like imps in a
Doom level.
In the process, it was necessary to rearrange some of the shutdown code
because vulkan_vid_render_shutdown destroys the shared command pool, but
the pool is required for freeing the command buffers, but there was a
minor mess of long-lived staging buffers being freed afterwards. That
didn't end particularly well.
This ensures that the plugin's shutdown function won't get called twice
in the event of an error in the plugin's unload sequence triggering a
second Sys_Shutdown, especially if the plugin is being unloaded as a
part of another sub-system's shutdown sequence (which is probably in
itself a design mistake, need to look into that).
While gcc was quite correct in its warning, all I needed was to
explicitly truncate the string. I don't remember why I didn't do that
back when I made the changes in 4f58429137, but it works now, and the
surrounding code does expect the string to be no more than 15 chars
long. This fixes yet another memory leak (but timedemo over multiple
runs still leaks like a sieve).
This is meant for a "permanent" tear-down before freeing the memory
holding the VM state or at program shutdown. As a consequence, builtin
sub-systems registering resources are now required to pass a "destroy"
function pointer that will be called just before the memory holding
those resources is freed by the VM resource manager (ie, the manager
owns the resource memory block, but each subsystem is responsible for
cleaning up any resources held within that block).
This even enhances thread-safety in rua_obj (there are some problems
with cmd, cvar, and gib).
This gives a rather significant speed boost to timedemo demo1: from
about 2300-2360fps up to 2520-2600fps, at least when using
multi-texture.
Since it was necessary for testing the scrap, gl got the ability to set
the console background texture, too.
While it takes one extra step to grab the marksurface pointer,
R_MarkLeaves and R_MarkLights (the two actual users) seem to be either
the same speed or fractionally faster (by a few microseconds). I imagine
the loss gone to the extra fetch is made up for by better bandwidth
while traversing the leafs array (mleaf_t now fits in a single cache
line, so leafs are cache-aligned since hunk allocations are aligned).
Unfortunately, the animations are pre-baked (by the loader) blocking
run-time determined animations (IK etc). However, this at least gets
everything working so the basics can be verified (the shader posed some
issue resulting in horror movies ;).
It copies an entire hierarchy (minus actual entities, but I'm as yet
unsure how to proceed with them), even across scenes as the source scene
is irrelevant and the destination scene is used for creating the new
transforms.
Brush models looked a little too tricky due to the very different style
of command queue, so that's left for now, but alias, iqm and sprite
entities are now labeled. The labels are made up of the lower 5 hex
digits of the entity address, the position, and colored by the
normalized position vector. Not sure that's the best choice as it does
mean the color changes as the entity moves, and can be quite subtle
between nearby entities, but it still helps identify the entities in the
command buffer.
And, as I suspected, I've got multiple draw calls for the one ogre. Now
to find out why.
The bones aren't animated yet (and I realized I made the mistake of
thinking the bone buffer was per-model when it's really per-instance (I
think this mistake is in the rest of QF, too)), skin rendering is a
mess, need to default vertex attributes that aren't in the model...
Still, it's quite satisfying seeing Mr Fixit on screen again :)
I wound up moving the pipeline spec in with the rest of the pipelines as
the system isn't really ready for separating them.
The plists can now be accessed by name and the forward render pass
config is available (but not used, or tested beyond syntax). I was going
to have the IQM pipeline spec separate but ran into limitations in the
system (which needs a lot of polish, really).
That @inherit is pretty useful :) This makes it much easier to see how
different pipelines differ or how they are the similar. It also makes it
much clearer which sub-pass they're for.
I was wondering why scaled-down quake-guy was dimmer than full-size
quake-guy. And the per-fragment normalization gives the illusion of
smoothness if you don't look at his legs (and even then...).
Maps specify sunlight as shining in a specific direction, but the
lighting system wants the direction to the sun as it's used directly in
shading calculations. Direction correctness confirmed by disabling other
lights and checking marcher's outside scene (ensuring the flat ground
was lit). As a bonus, I've finally confirmed I actually have the skybox
in the correct orientation (sunlight vector more or less matched the
position of the sun in marcher's sky).
I'm not sure what's up with the weird lighting that results from dynamic
lights being directional (sunlight works nicely in marcher, but it has a
unit vector for position).
Abyss of Pandemonium uses global ambient light a lot, but doesn't
specify it in every map (nothing extracting entities and adding a
reasonable value can't fix). I imagine some further tweaking will be
needed.
The parsing of light data from maps is now in the client library, and
basic light management is in scene. Putting the light loading code into
the Vulkan renderer was a mistake I've wanted to correct for a while.
The client code still needs a bit of cleanup, but the basics are working
nicely.
This replaces *_NewMap with *_NewScene and adds SCR_NewScene to handle
loading a new map (for quake) in the renderer, and will eventually be
how any new scene is loaded.
This leaves only the one conditional in the shader code, that being the
distance check. It doesn't seem to make any noticeable difference to
performance, but other than explosion sprites being blue, lighting
quality seems to have improved. However, I really need to get shadows
working: marcher is just silly-bright without them, and light levels
changing as I move around is a bit disconcerting (but reasonable as
those lights' leaf nodes go in and out of visibility).
Id Software had pretty much nothing to do with the vulkan renderer (they
still get credit for code that's heavily based on the original quake
code, of course).
It's not used yet, and thus may have some incorrect settings, but I
decided that I will probably want it at some stage for qwaq. It's
essentially was was in the original spec, but updated for some of the
niceties added to parsing since I removed it back then. It's also in its
own file.
Just "loading" and "unloading" (both really just hints due to the
caching system), and an internal function for converting a handle to a
model pointer, but it let me test IQM loading and unloading in Vulkan.
The model system is rather clunky as it is focused around caching, so
unloading is more of a suggestion than anything, but it was good enough
for testing loading and unloading of IQM models in Vulkan.
Despite the base IQM specification not supporting blend-shapes, I think
IQM will become the basis for QF's generic model representation (at
least for the more advanced renderers). After my experience with .mu
models (KSP) and unity mesh objects (both normal and skinned), and
reviewing the IQM spec, it looks like with the addition of support for
blend-shapes, IQM is actually pretty good.
This is just the preliminary work to get standard IQM models loading in
vulkan (seems to work, along with unloading), and they very basics into
the renderer (most likely not working: not tested yet). The rest of the
renderer seems to be unaffected, though, which is good.
The resource subsystem creates buffers, images, buffer views and image
views in a single batch operation, using a single memory object to back
all the buffers and images. I had been doing this by hand for a while,
but got tired of jumping through all those vulkan hoops. While it's
still a little tedious to set up the arrays for QFV_CreateResource (and
they need to be kept around for QFV_DestroyResource), it really eases
calculation of memory object size and sub-resource offsets. And
destroying all the objects is just one call to QFV_DestroyResource.
I might need to do similar for other formats, but i ran into the problem
of the texture type being tex_palette instead of the expected tex_rgba
when pre-(no-)loading a tga image resulting in Vulkan not liking my
attempt at generating mipmaps.
This allows the fuzzy bsearch used to find a def by address to work
properly (ie, find the actual def instead of giving some other def +
offset). Makes for a much more readable instruction stream.
The scene id is in the lower 32-bits for all objects (upper 32-bits are
0 for actual scene objects) and entity/transform ids are in the upper
32-bits. Saves having to pass around a second parameter in progs code.
pr_type_t now contains only the one "value" field, and all the access
macros now use their PACKED variant for base access, making access to
larger types more consistent with the smaller types.
Vulkan doesn't appreciate the empty buffers that result from the model
not having any textures or surfaces that can be rendered (rightfully so,
for such a bare-metal api).
I doubt the calls were ever actually made in a normal map due to the
node actually being a node when breaking out of the loop, but when I
experimented with an empty world model (no nodes, one infinite empty
leaf) I found that visit_leaf was getting called twice instead of once.
Since it is updated every frame, it needs to be as fast as possible for
the cpu code. This seems to make a difference of about 10us (~130 ->
~120) when testing in marcher. Not a huge change, but the timing
calculation was wrapped around the entire base world pass, so there was
a fair bit of overhead from bsp traversal etc.
It makes a significant difference to level load times (approximately
halves them for demo1 and demo2). Nicely, it turns out I had implemented
the rest of the staging buffer code (in particular, flushing) correctly
in that it seems there's no corruption any of the data.
They're really redundant, and removing the next pointer makes for a
slightly smaller cvar struct. Cvar_Select was added to allow finding
lists of matching cvars.
The tab-completion and config saving code was reworked to use the hash
table DO functions. Comments removed since the code was completely
rewritten, but still many thanks to EvilTypeGuy and Fett.
Hash_Select returns a list of elements that match a given criterion
(select callback returning non-0).
Hash_ForEach simply calls a function for every element.
And use it for hud_scoreboard_gravity. Putting the enum def in view made
the most sense as view does own the base type and the enum is likely to
be by useful for other settings.
I think I'd gotten distracted while making the changes to the server,
then simply copied the partial changes to the client. It didn't blow up
thanks to the backing store bing char * and the type sized for int, so
safe on any platform, but useless as it wasn't connected properly.
It's actually pretty neat being able to directly, but safely, control a
function pointer via a cvar :)
The misinterpretations were due to either the cvar not being accessed
directly by the engine, but via only the callback, or the cvars were
accesssed only by progs (in which case, they should be float). The
remainder are a potential enum (hud gravity) and a "too hard basket"
(rcon password: need to figure out how I want to handle secret strings).
Other parts of quakefs treat an empty path as an error, so fs_sharepath
and fs_userpath must never be empty or they will effectively be
rejected. While the user explicitly setting them to empty strings is one
way for them to become empty, another is QFS_CompressPath compressing
'.' to an empty path, which makes it rather difficult to set up the
traditional quake directory tree (ie, operate from the current
directory).
My script didn't know what type to make the cvars since they're not used
directly by the code, so they got treated as strings instead of ints or
floats.
This is an extremely extensive patch as it hits every cvar, and every
usage of the cvars. Cvars no longer store the value they control,
instead, they use a cexpr value object to reference the value and
specify the value's type (currently, a null type is used for strings).
Non-string cvars are passed through cexpr, allowing expressions in the
cvars' settings. Also, cvars have returned to an enhanced version of the
original (id quake) registration scheme.
As a minor benefit, relevant code having direct access to the
cvar-controlled variables is probably a slight optimization as it
removed a pointer dereference, and the variables can be located for data
locality.
The static cvar descriptors are made private as an additional safety
layer, though there's nothing stopping external modification via
Cvar_FindVar (which is needed for adding listeners).
While not used yet (partly due to working out the design), cvars can
have a validation function.
Registering a cvar allows a primary listener (and its data) to be
specified: it will always be called first when the cvar is modified. The
combination of proper listeners and direct access to the controlled
variable greatly simplifies the more complex cvar interactions as much
less null checking is required, and there's no need for one cvar's
callback to call another's.
nq-x11 is known to work at least well enough for the demos. More testing
will come.
The prefix gives more context to the error messages, making the system a
lot easier to use (it was especially helpful when getting my cvar revamp
into shape).
Based on the flags type used in vkparse (difference is the lack of
support for plists). Having this will make supporting named flags in
cvars much easier (though setting up the enum type is a bit of a chore).
This allows for easy (and safe) printing of cexpr values where the type
supports it. Types that don't support printing would be due to being too
complex or possibly write-only (eg, password strings, when strings are
supported directly).
Surprisingly, only two, but they were caught by the different value
fields being used, thus the cvar was checked in multiple places. I
imagine that's not really all that common, so there may be some
inconsistencies between default value and use.
This is progress towards #23. There are still some references to
host_time and host_client (via nq's server.h), and a lot of references
to sv and svs, but this is definitely a step in the right direction.
This allows a single render pass description to be used for both
on-screen and off-screen targets. While Vulkan does allow a VkRenderPass
to be used with any compatible frame buffer, and vkparse caches a
VkRenderPass created from the same description, this allows the same
description to be used for a compatible off-screen target without any
dependence on the swapchain. However, there is a problem in the caching
when it comes to targeting outputs with different formats.
As I had suspected, it's due to a synchronization problem between the
scrap and drawing. There's actually a double problem in that data
uploaded to the scrap isn't flushed until the first frame is rendered
causing a quick init-shutdown sequence to take at least five seconds due
to the staging buffer waiting (and timing out) on a stuck fence.
Rendering just one frame "fixes" the problem (draw was one of the
earliest subsystems to get going in vulkan).
Surprisingly, only two, but they were caught by the different value
fields being used, thus the cvar was checked in multiple places. I
imagine that's not really all that common, so there may be some
inconsistencies between default value and use.
This is progress towards #23. There are still some references to
host_time and host_client (via nq's server.h), and a lot of references
to sv and svs, but this is definitely a step in the right direction.
Since it is updated every frame, it needs to be as fast as possible for
the cpu code. This seems to make a difference of about 10us (~130 ->
~120) when testing in marcher. Not a huge change, but the timing
calculation was wrapped around the entire base world pass, so there was
a fair bit of overhead from bsp traversal etc.
The improved allocation overheads have been implemented for gl and sw,
and glsl no longer uses malloc. Using array textures will have to wait
as the current texture loading code doesn't support them.
Really, this won't make all that much difference because alias models
with more than one skin are quite rare, and those with animated skin
groups are even rarer. However, for those models that do have more than
one skin, it will allow for reduced allocation overheads, and when
supported (glsl, vulkan, maybe gl), loading all the skins into an array
texture (since all skins are the same size, though external skins may
vary), but that's not implemented yet, this just wraps the old one skin
at a time code.
While looking at the deferred attachment images with using a template in
mind, I noticed that the opaque attachment was using 8-bit color. The
problem is, it's meant to be HDRI with the compose pass crunching it
down to LDRI. Switching to 16-bit float does seem to have made a subtle
difference (hey, it's still quake data, not much HDRI in there).
That certainly makes it nicer to work with large sets, and shows one way
to be careful with allocated resources: don't allocate them in the
inherited data and use a template that needs a few things filled in to
be valid. Also, it seems that overriding values in sub-structures "just
works" :)
It simply parses the referenced plist dictionary (via @inherit =
plist.path;) into the current data block, then allows the data to be
overwritten by the current plist dictionary. This may be a bit iffy for
any allocated resources, so some care must be taken, but it seems to
work nicely.
This allows a single render pass description to be used for both
on-screen and off-screen targets. While Vulkan does allow a VkRenderPass
to be used with any compatible frame buffer, and vkparse caches a
VkRenderPass created from the same description, this allows the same
description to be used for a compatible off-screen target without any
dependence on the swapchain. However, there is a problem in the caching
when it comes to targeting outputs with different formats.
This makes much more sense as they are intimately tied to the frame
buffer on which a render pass is working. Now, just the window width
and height are stored in vulkan_ctx_t. As a side benefit,
QFV_CreateSwapchain no long references viddef (now just palette and
conview in vulkan_draw.c to go).
While I have trouble imagining it making that much performance
difference going from 4 verts to 3 for a whopping 2 polygons, or even
from 2 triangles to 1 for each poly, using only indices for the vertices
does remove a lot of code, and better yet, some memory and buffer
allocations... always a good thing.
That said, I guess freeing up a GPU thread for something else could make
a difference.
I think I had gotten lucky with captures not being corrupt due to them
being much bigger than all but the L3 cache (and then they're over 1/2
the size), so the memory was being automatically invalidated by other
activity. Don't want to trust such luck, though.
It makes a significant difference to level load times (approximately
halves them for demo1 and demo2). Nicely, it turns out I had implemented
the rest of the staging buffer code (in particular, flushing) correctly
in that it seems there's no corruption any of the data.
This means that a tex_t object is passed in instead of just raw bytes
and width and height, but it means the texture can specify whether it's
flipped or uses BGR instead of RGB. This fixes the upside down
screenshots for vulkan.
This fixes (*ahem*) the vulkan renderer segfaulting when attempting to
take a screenshot. However, the image is upside down. Also, remote
snapshots and demo capture are broken for the moment.
QFS_NextFilename was renamed to QFS_NextFile to reflect the fact it now
returns a QFile pointer for the newly created file (as well as the
name). This necessitated updating WritePNG to take a file pointer
instead of a file name, with the advantage that WritePNGqfs is no longer
necessary and callers have much more control over the creation of the
file.
This makes QFS_NextFile much more secure against file system race
conditions and attacks (at least in theory). If nothing else, it will
make it more robust in a multi-threaded environment.
It's not there yet as it promptly closes the file and returns only the
filename (and then only the portion within the user's directory tree).
However, this worked nicely as a test for Sys_UniqueFile.
QF currently uses unique file names for screenshots and server-side
demos (and remote snapshots), but they're generally useful.
QFS_NextFilename has been filling this role, but it is highly insecure
in its current implementation. This is the first step to taking care of
that.
The tests fail due to differences in how clang and gcc treat floating
point to unsigned integral type conversions when the values overflow. It
wouldn't be so bad if clang was consistent with conversions to 32-bit
unsigned integers, like it seems to be with conversion to 64-bit
unsigned integers.
With this, the "get QF building with clang" mini-project is done and I
won't have to panic when someone comes to me and asks if it will work.
At worst, there'll be a little bit-rot.
Only edicts themselves get a smaller alignment (4, 8 or 32 bytes,
depending on hardware and progs version). I didn't want to waste too
much memory on edict alignment for progs that don't need any better than
void *, but the zone really wants 64 bytes, and the stack might as well
have 64 bytes as well. Fixes a segfault when running v6 progs in a clang
build (clang got really agressive with simd in zone.c).
gcc and clang have rather different swizzle builtins, but both do a nice
job of optimizing the intuitive initializer swizzle (I think gcc 8(?)
didn't do such a good job thus my use of __builtin_shuffle).
clang doesn't like anything but a bare 0 as null (and in some of the
cases, it was quite right: '\0' should not be treated as a null
pointer). And the crashers were just for paranoia and probably aren't
needed any more (kept for now, though).
It seems clang defaults to unsigned for enums. Interestingly, gcc was ok
with the checks being either way. I guess gcc treats enums that *can* be
unsigned as DWIM.
Still work with gcc, of course, and I still need to fix them properly,
but now they're actually slightly easier to find as they all have vec_t
and FIXME on the same line.
Viewport and FOV updates are now separate so updating one doesn't cause
recalculations of the other. Also, perspective setup is now done
directly from the tangents of the half angles for fov_x and fov_y making
the renderers independent of fov/aspect mode. I imagine things are a bit
of a mess with view size changes, and especially screen size changes
(not supported yet anyway), and vulkan winds up updating its projection
matrices every frame, but everything that's expected to work does
(vulkan errors out for fisheye or warp due to frame buffer creation not
being supported yet).
If the entity didn't have a known model type, R_StoreEfrags would get
stuck in an infinite loop (fortunately, never actually happened. The
result of making it not call Sys_Error for unknown models)).
I meant to do this a while ago but forgot about it. Things are a bit of
a mess in that the renderer knows too much about entities, but
eventually the renderer will know about only things to render (meshes,
particles, etc).
The quake-specific enums are now in the client header, and the particle
system now has a gravity field rather than getting it from
vid_render_data (which I hope to eventually get rid of entirely).
r_refdef is really meant for holding the various screen "constants" for
the software renderer rather than the more generic scene stuff. All the
fields referenced by the low level rendering code (especially assembly)
have been moved to the beginning of the struct (and nicely fit within 64
bytes). The other fields should be moved elsewhere, but not this commit.
On top of that, R_ViewChanged is much easier to read, and there are
fewer static globals.
Now GL perspective matrix setup matches that of GLSL and Vulkan, and
GL's z_up matrix matches GLSL's (as it should, since they're really
going through the same API). GL also needs the depth adjustmet matrix
now. Other than having to google the docs for glFrustum, there's nothing
wrong with the function itself, but it's nice to have direct control
over the matrices.
In the process, I discovered how horribly confused I've been at times
with respect to the handedness of GL and Quake: GL is right-handed
(y-up, z-out, x-right), as is Quake itself (but z-up, y-left, x-in), but
as the perspective matrix used in the three renderers expects z-in,
having x-right and y-up makes the matrix effectively left-handed (not
for Vulkan though, because there it's y-down, x-right, z-up, so
right-handed again).
Of course, it's not as correct as glsl or sw due to using polygons and
uvs rather than a fragment shader (not that such is out of the question
since GL 3.0 is requested, but I don't feel like getting shaders going
just for a couple of post-processing effects in an obsolete renderer).
While it's not where I want it to be, it at least now no longer messes
with frame buffer binding or the view ports. This involved switching
around buffers in D_WarpScreen so that the main buffer could be bound
before post-processing.
The cvar setup for particles is a bit wonky in that the arrays get
initialized using the default max particle count but never updated.
Though things could be improved some more, this solution works (and has
been more or less copied to gl, but I couldn't reproduce the crash
there, or even the valgrind error).
The code dealing with state is a bit of a mess, but everything is
working nicely. Get around 400fps when all 6 faces need to be rendered
(no surprise: it should be about 1/6 of that for normal rendering). The
messy state handling code did not come as a surprise as I suspected
there were various mistakes in my scene rendering "recipe", and fisheye
highlighted them nicely (I'm sure getting this stuff working in Vulkan
will highlight even more issues).
Finally, after a decade :P Looks pretty good, too, and is (almost)
properly scaled to the resolution (almost because the effect is a little
squashed, but I think the sw renderer does the same).
The GLSL compiler requires any #version lines to be the first (real)
line of the program, even #line causes an error, so if the first line of
the chunk starts with #version, insert the #line directive as the second
line.
Again, gl/vulkan not working yet (on the assumption that sw would be
trickier).
Fisheye overrides water warp because updating the projection map every
frame is far too expensive.
I've added a post-process pass to the interface in order to hide the
implementation details, but I'm not sure I'm happy about how the
multi-pass rendering for cube maps is handled (or having the frame
buffers as exposed as they are), but mainly because Vulkan will make
implementation interesting.
For now, OpenGL and Vulkan renderers are broken as I focused on getting
the software renderer working (which was quite tricky to get right).
This fixes a couple of issues: the segfault when warping the screen (due
to the scene rendering move invalidating the warp buffer), and warp
always having 320x200 resolution. There's still the problem of the
effect being too subtle at high resolution, but that's just a matter of
updating the tables and tweaking the code in D_WarpScreen.
Another issue is the Draw functions should probably write directly to
the main frame buffer or even one passed in as a parameter. This would
remove the need for binding the main buffer at the beginning and end of
the frame.
This used to be handled by R_RenderView (encompassing all of the
rendering) before the scene rendering was moved out to r_screen. This
fixes the stuck time in 32-bit nq-win.
Its guts have been moved to D_Init temporarily while I work on the
frame buffer design. This is actually a big part of that work as it
moves most of the frame buffer creation into the one place, making it
easier to ensure I get all the sub-buffers and caches created.
With what I have planned for frame buffers etc, GL 3.0 will be needed
even for the fixed-function GL renderer, and then I might even take the
GLSL renderer to 4.6 (dunno yet). This means that wgl will need to be
updated too, and I've found the info I need for that, but it's a bit
much to take on just yet.
I think the widespread use of recalc_refdef (and force_fullscreen) was
the result of a rushed merge of the renderer and video code (I do seem
to remember sprinkling them around). This cleans the two out of the
client code.
This avoids the possibility of a singularity (and thus the temptation to
use Sys_Error). While the rendering is rubbish, 0 degrees is allowed
because values less than 1 should be allowed, but where does one stop?
170 is the maximum in order to avoid any issues with (near) parallel or
inverted frustum planes (or other fun things) in the low level code.
Other than the view model (undecided on the approach) this has
R_RenderView pretty much pulled out of the low level renderers. With
this, I'll be able to focus on scene handling for a bit then getting
shadows and fisheye working (again for fisheye).
r_screen isn't really the right place, but it gets the scene rendering
out of the low-level renderers and will make it easier to sort out
later, and hopefully easier to figure out a good design for vulkan.
gl_overbright_f shouldn't need to run through any entity queues to
update the light maps as only the world model has light maps, and
hitting the world model should hit all its sub-models.
The change to using separate per-model-type entity queues resulted in
the lighting vector used for alias and iqm models being in an ephemeral
location (in the shared setup_lighting function's stack frame). This
resulted in the model rendering code getting a garbage vector due to it
being overwritten by another stack frame. What I don't get is why the
garbage varied from run to run for the same demo (demo2, the first scrag
behind the start door showed the bad lighting nicely), which made
tracking down the offending commit (and thus the code) rather
troublesome, though once I found it, it was a bit of a face-palm moment.
Move r_pcurrentvertbase into the sw renderer, cleaning up gl's use of
(not really needed there). Not ready to move r_bsp into the main bin yet
as there are linking issues since only the low-level code references any
of its symbols.
The code is really part of scene (not a typo wrt r_screen: that is
misnamed as such, or at least SCR_UpdateScreen needs to be split into
screen (2d overlay, really) and scene updates).
This breaks fisheye rendering as the fisheye code calls the actual scene
render code multiple times, but the fisheye code is called by said scene
render code via a diversion. The fisheye needs to be moved out to the
high level scene render, but that will takes some extra work for frame
buffer setup.
The two aren't compatible (but warping might be doable in the fisheye
code). The whole frame setup code needs a rework, and really, even the
buffer handling.
It being on the stack was a bad idea as R_RenderWorld returns before the
scans are rendered and thus the entity pointer winds up pointing to
abandoned stack space.
While the scheme of using our own allocated did work just fine, fisheye
rendering uses glGenTextures which caused a texture id clash and thus
invalid operations (the cube map texture happened to be the same as the
console background texture). Sure, I could have just "fixed" the fisheye
init code, but this brings gl closer in line with glsl (which makes
extensive use of glGenTextures and glDeleteTextures). This doesn't fix
any texture leaks gl has (plenty, I imagine), but it's a step in the
right direction.
Only for gl and sw at the moment (want to merge things further before I
do anything for glsl or vulkan). However, with with I've learned getting
gl and sw to work, glsl and vulkan will be trivial.
R_RecursiveLightUpdate has been obsolete for a very long time, and
R_Mirror is just wrong (needs envmaps etc, wonder if it can be done in
the fixed function code using skyclip?)
Finally. I never liked it (felt bad adding it in the first place), and
it has caused confusion with function and global variable names, but it
did let me get the render plugins working.
They're still slightly confusing, but the situation itself is confusing,
but the comments should be a little more helpful now as they are more
explicit about the orientation of the matrices and just which axis
points where.
This moves the common camera setup code out of the individual drivers,
and completely removes vup/vright/vpn from the non-software renderers.
This has highlighted the craziness around AngleVectors with it putting
+X forward, -Y right and +Z up. The main issue with this is it requires
a 90 degree pre-rotation about the Z axis to get the camera pointing in
the right direction, and that's for the native sw renderer (vulkan needs
a 90 degree pre-rotation about X, and gl and glsl need to invert an
axis, too), though at least it's just a matrix swizzle and vector
negation. However, it does mean the camera matrices can't be used
directly.
Also rename vpn to vfwd (still abbreviated, but fwd is much clearer in
meaning (to me, at least) than pn (plane normal, I guess, but which
way?)).
I'd been considering it for a while, but in the end, all the issues it
presented made me decide it wasn't worth merging and was never really
worth keeping: it was a neat proof of concept but of little actual use,
especially now everyone either has an OK GPU or would want to stick to
8-bit rendering anyway (sorry L-Havoc).
However, both it and my merge work are preserved in git history :)
16 and 32 bit rendering are disabled at the moment because there's a
weird segfault I need to fix, but the 8-bit dynamic lights are doing
weird things (for x11, too) when updating the light maps.
I got tired of having to maintain two separate software renderers, but
didn't want to just nuke sw32, so its core changes are merged into sw.
Alias model rendering is broken, but I know exactly what's wrong and how
to fix it, just need to take care due to asm.
So far, in gl and glsl, but viewposition is much clearer than r_origin
(despite being the same thing), and modelorg is just confusing (I think
it's the view position relative to the current model).
GL still has its own functions for enabling and disabling fog while
rendering, but GLSL doesn't need such (thanks to the shaders), nor will
vulkan (and the software renderers don't support fog).
This is a step towards high-level unification of the renderers, as far
as possible keeping only actual low-level implementation details in the
individual renderers (some higher level stuff, eg shadows, is expected
to be per-renderer as some things are just not feasible to implement in
all renderers). However, the idea is to move the high-level
functionality into scene rendering.
As qwaq doesn't yet do any 3d rendering, it doesn't use efrags and thus
wasn't pulling in the object file, but the various renderers were trying
to access it. And I thought plugin builds were more difficult (I had
forgotten).
Only CaptureBGR is per-renderer as the rest of the screenshot code uses
it to do the actual capture (which is target dependent). Vulkan is
currently broken due to capture being an asynchronous process and the
rest of the code expecting capture to be synchronous (also, bgr vs rgb).
The best thing is all renderers now write the same format (currently
png).
I'm not sure what the author of that code was thinking (maybe trying to
do 4 pixels at a time?), but the resulting code still did only one.
Better to remove all the casts, use the right pointer type, and keep the
code clear.
Drawing sky chains first ensures that sky surfaces correctly block parts
of the map that should not be visible (by writing the correct depth to
the depth buffer when doing box or dome skies). Writing brush models
first means that the models (ammo boxes etc) could be visible when they
should not be.
While there's currently only the one still, this will allow the entities
to be multiply queued for multi-pass rendering (eg, shadows). As the
avoidance of putting an entity in the same queue more than once relies
on the entity id, all entities now come from the scene (which is stored
in cl_world in the client code for nq and qw), thus the extensive
changes in the clients.
The root transform of each hierarchy can be extracted from the first
transform of the list in the hierarchy, so no information is lost. The
main reason for the change is I discovered (obvious in hindsight) that
deleting root transforms was O(n) due to keeping them in an array, thus
the use of a linked list (I don't expect a hierarchy to be in more than
one such list), and I didn't want the transforms to be in a linked list.
GL and GLSL were drawing the view model after particles instead of
before. For GL, this is likely due to avoiding fog affecting the view
model (which I think is not the right thing to do), and GLSL due to
copying GL (because I had no idea at the time). This makes the two
renderers consistent with the software renderers, and might even speed
things up a little as that's one less set of blends to do when the
particles are covered by the view model (I don't expect much
difference).
While I doubt the difference is all that significant, this should speed
up entity rendering because it cuts out a lot of branching, and
eliminates scanning the same list multiple times only to not do anything
for large chunks of the list.
Since transforms now know the scene to which they belong, and they know
when they are root and when not, getting the transform code to manage
the scene roots is the best way to keep the list of root transforms
consistent.
It turns out cam_controls is for pointing the player model in the
direction of movement rather than controlling the camera (I should add
proper camera controls).
I finally spent the time to work out what it was trying to say. Still
not sure it's clear, but what is clear is that there was probably some
disagreement at Id about the orientation of the world.
They no longer spin like crazy. I don't know how, but I must have broken
something over the years as I'm sure Seth had the code working (and I
seem to remember seeing it working). In the process, clean up a lot of
the angle mess.
It's a lot easier to read (and see the difference between modes 2 and 3)
with all the ifs removed, and the state is properly is chasestate_t now
(though not handled properly on level reset etc).
The more advanced modes are rather broken (continuous spinning), but
they may have been for a while. The bulk of the various changes were due
to renaming viewstate's origin and angles to make their meaning more
explicit.
They've been near-identical for years, now they're only one. It proved
necessary to start merging the HUD code which for now is just a few cvar
declarations (not even init), but that should be a separate set of
commits.
The actual view and projection matrices are now consistent with vulkan,
with the vulkan-gl disparity moved into adjustment matrices. The goal is
to allow the same camera data and code to be used in all renderers. The
extra matrix multiplication shouldn't be too expensive as it occurs only
when the field of view (not often, under user control) or near and far
clip distances (very rarely) change.
It holds the data for a basic 3d camera (transform, fov, near and far
clip). Not used yet as there is much work to be done in cleaning up the
client code.
Handling of view angles is a little hacky at the moment, but this gets
the chase camera code and most of the common input code into one place,
which will make cleaning up the camera code much easier.
While both matrices had positive determinants in the first place, I find
the projection matrix easier to understand without all the negatives,
and having quake-x/vulkan-z positively parallel in the z-up matrix makes
that a lot easier to think about.
Regardless of whether the sky is spinning or not, the matrix needs to be
updated with the current origin in order to get the direction vector
right in the shader. Also, it's in the update that the required x-y
plane rotation gets in so the skies move in the correct direction.
This actually has at least two benefits: the transform id is managed by
the scene and thus does not need separate management by the Ruamoko
wrapper functions, and better memory handling of the transform objects.
Another benefit that isn't realized yet is that this is a step towards
breaking the renderers free of quake and quakeworld: although the
clients don't actually use the scene yet, it will be a good place to
store the rendering information (functions to run, etc).
I've run into a bit of an issue with transform management (really, just
need to make them owned by the scene, but that means creating a scene
for quake and quakeworld).
This is the bulk of the work for recording the resource pointer with
with builtin data. I don't know how much of a difference it makes for
most things, but it's probably pretty big for qwaq-curses due to the
very high number of calls to the curses builtins.
Closes#26
The zone memory block header is 64 bytes, so allocating a single 8 byte
selector is rather wasteful. Instead, allocate selectors in large chunks
(currently 64) and divvy them out as needed. Significantly reduces
memory pressure in large Ruamoko progs.
These add legacy support for basic float bitops (& | ^ ~). Avoiding the
instructions would require tot only the source to be converted, but also
the servers (as they do access those fields), and this seemed to be too
much.
It's not enforced a this stage, and it would be easy enough to handle,
but it turns out all the standard quake and quakeworld progs never used
... for the print functions: the behavior of PF_VarString was
undocumented and so... tough :P.
I had forgotten that unsigned division was different from signed
division (rather silly of me). However, with some testing and analysis,
unsigned true modulo is not needed as it's not possible to have
negative inputs and thus it's the same as remainder.
It now takes the function name to print in error message (passed on to
PR_Sprintf) and the argument number of the format string. The variable
arguments (in ...) are assumed to be immediately after the format
argument.
This loads the current return pointer into the specified register. No
offset is used (should make that an error, but for now any offset is
simply ignored). This is part of the fix for getting obj_msg_sendv to
work with return values.
With the return buffer in progs_t, it could not be addressed by the
progs on 64-bit machines (this was intentional, actually), but in order
to get obj_msg_sendv working properly, I needed a way to "bounce" the
return address of a calling function to the called function. The
cleanest solution I could think of was to add a mode to the with
instruction allowing the return pointer to be loaded into a register and
then calling the function with a 0 offset for the return value but using
the relevant register (next few commits). Testing promptly segfaulted
due to the 64-bit offset not fitting into a 32-bit value.
This gets message forwarding apparently working, though something isn't
quite right as qwaq-app doesn't update properly when I try to step
through the program, but that could be an error elsewhere.
The plan is to use the types to extract the number of parameters for a
selector when it is necessary to know the count. However, it'll probably
become useful for something else alter (these things seem to always do
so).
This takes care of the problems with PR_RESET_PARAMS (which has recently
become just a wrapper for PR_SetupParams) changing the stack and causing
PR_CallFunction to save the wrong stack pointer. Message forwarding is
currently broken for Ruamoko ISA progs, but that is due to not having a
valid pr_argc. However, I do have a plan involving extracting the
parameter count from the selector, but that's something for a later
commit. Everything else seems to be ok (my little game is working
nicely).
rua_obj was skipped because that looks to be a bit more work and should
be a separate commit.
This is to avoid the stack getting mangled when calling progs functions
with parameters.
I suppose having one builtin call another was a neat idea at the time,
and really could have been fixed by simply wrapping the calls with
push/pop frame, but this is probably faster.
obj_msg_sendv needs to push the parameters onto the stack for Ruamoko
progs, but this causes problems because PR_CallFunction winds up
recording the wrong stack pointer for progs functions, and nothing
restores the stack for builtins. The handling is basically the same as
for the return pointer.
It's a bit disconcerting seeing a builtin in the top 10 when builtins
are counted by call while progs functions are counted by instruction.
Also, show the total profile after the function top-10 list.
pr_argc cannot be used in Ruamoko progs because nothing sets it. This
fixes the parse errors and resulting segfault when trying to parse the
Vulkan pipeline config.
It's currently only 4 (or even 3 for v6) words, but this fixes false
positives when checking for null pointers in Ruamoko progs due to
pr_return pointing to the return buffer and thus outside the progs
memory map resulting in an impossible to exceed value.
Since Z_Malloc uses Z_TagMalloc to do the work, this ensures the check
is always run.
Also, add the check to Z_Realloc when it needs to adjust an existing
block.
Builtins that call progs with parameters now must always wrap the call
to PR_ExecuteProgram so that the data stack is properly preserved across
the call.
I need to do an audit of all the calls to PR_ExecuteProgram.
It turns out the return pointer still needs to be saved even when a
builtin sets up a chain call to progs, but rather than the pointer being
simply restored, it needs to be saved in the call stack exactly as if
the function was called directly by progs. This fixes the invalid self
issue quite thoroughly: parameter state seems to be correct across all
calls now.
I should set up an automated test now that I know and understand the
situation.
In Ruamoko ISA progs, the param pointers point to the stack and
generally must most be manipulated by builtins, and there is no need
anyway as Ruamoko doesn't have RCALL. Fixes the mangling of .super.
When calling a builtin, normally the return pointer needs to be
restored, but if the builtin changes the call depth (usually by
effecting "return foo()" as in support for objects, but possibly
setjmp/longjmp when they are implemented), then the return pointer must
not be restored. This gets vkgen past object allocation, but it dies
when trying to send messages to super. This appears to be a compiler
bug.
Since the operand types sort out the difference between asr and shr, no
need to give them different opnames. Means qfcc doesn't need to worry
about which one it's searching for.
Yet another redundant addressing mode (since ptr + 0 can be used), so
replace it with a variable-indexed array (same as in v6p). Was forced
into noticing the problem when trying to compile Machine.r.
I abandoned the reason for doing it (adding a pile of vector types), but
I liked the cleanup. All the implementations are hand-written still, but
at least the boilerplate stuff is automated.
Of course, only in Ruamoko progs, but it works quite nicely.
global_string is now passed the absolute address of the referenced
operand. With a little groveling through the progs stack, it should be
possible to resolve pointers to locals in functions further up the
stack.
This fixes Ruamoko's return format string. It looks like it's producing
the correct address (but doesn't show all the information it should),
but the rest of the debug code needs work locals.
It turned out I need locals count and params_start for debugging, so use
the progs version instead to bail early from PR_EnterFunction and
PR_LeaveFunction (which I had forgotten anyway, oops).
They now include base register index and effective address of the
operands (though it may be wrong for instructions that don't use a base
register for that operand).
This cleans up dprograms_t, making it easier to read and see what chunks
are in it (I was surprised to see only 6, the explicit pairs made it
seem to have more).
Intel hardware requires 32-byte alignment for lvec4 and dvec4.
Unfortunately, it turns out that my attempts to align progs data in qfcc
went awry do to the order block sizes are calculated when writing the
progs.
This makes return consistent with load, store, etc, though its
addressing mode is encoded in bits 5 and 6 of c rather than the opcode.
It turns out I had no tests for any of return's addressing modes other
than basic def references, so no tests needed changing.
The parameter defs are allocated from the parameter space using a
minimum alignment of 4, and varargs functions get a va_list struct in
place of the ...
An "args" expression is unconditionally injected into the call arguments
list at the place where ... is in the list, with arguments passed
through ... coming after the ...
Arguments get through to functions now, but there's problems with taking
the address of local variables: currently done using constant pointer
defs, which can't work for the base register addressing used in Ruamoko
progs.
With the update to test-bi's printf (and a hack to qfcc for lea),
triangle.r actually works, printing the expected results (but -1 instead
of 1 for equality, though that too is actually expected). qfcc will take
a bit longer because it seems there are some design issues in address
expressions (ambiguity, and a few other things) that have pretty much
always been there.
PR_SetupParams is new and sets up the parameter pointers so older code
that expects only up to 8 parameter will work with both v6p and Ruamoko
progs without having to check what progs are running. PR_SetupParams is
useful even when Ruamoko progs are expected as it reserves the required
space (respecting alignment) on the stack and returns a pointer to the
top (bottom? confusing) of the stack. PR_PushFrame and PR_PopFrame
need to be used around PR_SetupParams, regardless of using temp strings,
to avoid a stack leak (need to do an audit).
This is part of the work for #26 (Record resource pointer with builtin
function data). Currently, the data pointer gets as far as the
per-instance VM function table (I don't feel like tackling the job of
converting all the builtin functions tonight). All the builtin modules
that register a resources data block pass that block on to
PR_RegisterBuiltins.
The builtin and progs function data is overlaid so the extra data
doesn't cause too much memory to be used (it's actually 8 bytes smaller
now). The plan is to pre-compute the offsets based on the parameter
size and alignment data.
This will make it possible for the engine to set up their parameter
pointers when running Ruamoko progs. At this stage, it doesn't matter
*too* much, except for varargs functions, because no builtin yet takes
anything larger than a float quaternion, but it will be critical when
double or long vec3 and vec4 values are passed.
Just 32-bit rounding to next higher power of two, and base 2 logarithm.
Most importantly, they are suitable for use in initializers as they are
constant in, constant out.
As even the simplest v6p functions that take parameters but have no
local or temporary variables still have locals for the local copy of the
parameters, this is a both a good check for for the Ruamoko ISA as its
functions never have locals (everything's on the progs data stack), and
an optimization for v6p functions that have no params or locals (simple
getters (very rare?), most .ctor, etc).
And fix an incorrect definition for RETURN_QUAT.
Prefixed MAX_STACK_DEPTH and LOCALSTACK_SIZE (and LOCALSTACK_SIZE got an
extra _).
The rest is just edits to documentation comments.
ldconst isn't implemented yet but the plan is to load various constants
(eg, 0, 1, 2, pi, e, ...).
Stack adjust is useful for adding an offset to the stack pointer without
having to worry about finding it (and it checks for alignment).
nop is just that :)
Due to how OP_RETURN works, a destination is required for any function
returning data, but the caller may not have allocated any space for the
value. Thus the VM maintains a buffer into which the data can be put and
ignored. It also makes a good place for return values when the engine
calls Ruamoko code as trusting progs code with return sizes seems like a
recipe for disaster, especially if the return location is on the C
stack.
It turned out that address mode B was redundant as C with 0 offset
(immediate) was the same (except for the underlying C code of course,
but adding st->b is very cheap). This allowed B to be used for
entity.field for all transfer operations. Thus instructions 0-3 are now
free as load E became load B, and other than the specifics of format
codes for statement printing, transfers+lea are unified.
This makes the v6p instruction table consistent with the ruamoko
instruction table, and clears up some of the ugliness with the load,
store, and assign instructions (. .= and = are now spelled out). I think
I'd still prefer an enum code (faster) but at least this is more
readable.
long is ignored for double, and v6p progs are stuck with 32 bits for
longs (don't feel like extending v6p any further), but the basics are
there for Ruamoko.
short is ignored for ints because the minimum size is 32, and signed is
just noise for ints anyway (and no chars, so...).
unsigned, however, is finally implemented properly (or at least seems to
be working correctly: tests pass after getting things compiling again,
and lt.u is used where it should be :)
And provide a table for such for qfcc and the like. With this, using
pr_double_t (for example) in C will cause the double value to always be
8-byte aligned and thus structures shared between gcc and qfcc will be
consistent (with a little fuss to take care of the warts).
And other related fields so integer is now int (and uinteger is uint). I
really don't know why I went with integer in the first place, but this
will make using macros easier for dealing with types.
They are both gone, and pr_pointer_t is now pr_ptr_t (pointer may be a
little clearer than ptr, but ptr is consistent with things like intptr,
and keeps the type name short).
This required delaying the setting of the return pointer by call until
after the current pointer had been saved, and thus passing the desired
pointer into PR_CallFunction (which does have some advantages for C
functions calling progs functions, but some dangers too (should ensure a
128 byte (32 word) buffer when calling untrusted code (which is any,
really)).
This fixes the issue of the data stack not being restored properly
because the returning function needs to return a value from its local
variables (stored on the stack) and accessing stack data below the stack
pointer is a bad idea (sure, no interrupts yet, but who knows...).
Call's operand c is used to specify where the return value of the
function is to be stored. This gets both the correct function being
called, and the value being returned correctly. Test still fails due to
the stack restoration issue.
It currently fails for two reasons:
- call's mode selection is incorrect (never updated from when there was
only the one call instruction and the mode was encoded in operand c)
- return should automatically restore the stack pointer to the value it
had on entry to the function, thus allowing local values stored on
the stack to be safely returned.
I don't know why they were ever signed (oversight at id and just
propagated?). Anyway, this resulted in "unsigned" spreading a bit, but
all to reasonable places.
This has been a long-held wishlist item, really, and I thought I might
as well take the opportunity to add the instructions. The double
versions of STATE require both the nextthink field and time global to be
double (but they're not resolved properly yet: marked with
"FIXME double time" comments).
Also, the frame number for double time state is integer rather than
float.
In the end, I decided any/all/none should be separate from the other
horizontal ops, if I even do them (can be implemented by first
converting to bool, then using the appropriate horizontal operation (& |
etc).
ANY/ALL/NONE have been temporarily removed until I implement the HOPS
(horizontal operations) sub-instructions, which will all both 32-bit and
64-bit operands and several other operations (eg, horizontal add).
All the fancy addressing modes for the conditional branch instructions
have been permanently removed: I decided the gain was too little for the
cost (24 instructions vs 6). JUMP and CALL retain their addressing
modes, though.
Other instructions have been shuffled around a little to fill most of
the holes in the upper block of 256 instructions: just a single small
7-instruction hole.
Rearrangements in the actual engine are mostly just to keep the code
organized. The only real changes were the various IF statements and
dealing with the resulting changes in their addressing.