Where too far is 1024 units as that is the maximum supported, or the
radius. The change to using unsigned for the distances meant the simple
checks missed the effective max dist going negative, thus leading to a
segfault.
I had debated putting the blending in the compose subpass or a separate
pass but went with the separate pass originally, but it turns out that
removing the separate pass gains 1-3% (5-15/545 fps in a timedemo of
demo1).
It's a bit flaky for particles, especially at higher frame rates, but
that's due to supporting only 64 overlapping pixels. A reasonable
solution is probably switching to a priority heap for the "sort" and
upping the limit.
This required making the texture set accessible to the vertex shader
(instead of using a dedicated palette set), which I don't particularly
like, but I don't feel like dealing with the texture code's hard-coded
use of the texture set. QF style particles need something mostly for the
smoke puffs as they expect a texture.
It doesn't want to work on my nvidia (or more recent sid?) and doesn't
seem to be necessary. The problem may be multiple event sets before the
first wait, but investigation can wait for now.
This is probably the biggest reason I had problems with particles not
updating correctly: the descriptors were generally point pointing to
where the data actually was in the staging buffer.
I don't yet know whether they actually work (not rendering yet), but the
system isn't locking up, and shutdown is clean, so at least resources
are handled correctly.
Although it works as intended (tested via hacking), it's not hooked up
as the current frame buffer handling in r_screen is not readily
compatible with how vulkan output is handled. This will need some
thought to get working.
This splits up render pass creation so that the creation of the various
resources can be tailored to the needs of the actual render pass
sub-system. In addition, it gets window resizing mostly working (just
some problems with incorrect rendering).
It turns out the semaphore used for vkAcquireNextImageKHR may be left in
a signaled state for VK_ERROR_OUT_OF_DATE_KHR. While it seems to be
possible to clear the semaphore using an empty queue submission,
destroying and recreating the semaphore works well.
Still have problems with the frame buffer after window resize, though.
Swap chain acquisition is part of final output handling. However, as the
correct frame buffers are required for the render passes, the
acquisition needs to be performed during the preoutput render pass.
Window resize is still broken, but this is a big step towards fixing it.
This is the minimum maximum count for sampled images, and with layered
shadow maps (with a minimum of 2048 layers supported), that's really way
more than enough.
I guess nvidia gives a non-srgb format as the first in the list, but my
laptop gives an srgb format first, thus the unexpected difference in
rendering brightness. Hard-coding BGRA isn't any better, but it will do
for now.
Things are a bit of a mess with interdependence between sub-module
initialization and render pass initialization, and window resizing is
broken, but the main render pass rendering to an image that is then
post-processed (currently just blitted) is working. This will make it
possible to implement fisheye and water warp (and other effects, of
course).
When working, this will handle the output to the swap-chain images and
any final post-processing effects (gamma correction, screen scaling,
etc). However, currently the screen is just black because the image
for getting the main render pass output isn't hooked up yet.
Now each (high level) render pass can have its own frame buffer. The
current goal is to get the final output render pass to just transfer the
composed output to the swap chain image, potentially with scaling (my
laptop might be able to cope).
While the HUD and status bar don't cut out a lot of screen (normally),
they might start to make a difference when I get transparency working
properly. The main thing is this is a step towards pulling the 2d
rendering into another render pass so the main deferred pass is
world-only.
Using swizzles in an image view allows the same shader to be used with
different image "types" (eg, color vs coverage).
Of course, this needed to abandon QFV_CreateImageView, but that is
likely for the best.
It turns out that nearest filtering doesn't need any offsets to avoid
texel leaks so long as the screen isn't also offset. With this, the 2d
rendering looks good at any scale (minus the inherent blockiness).
It seemed like a good idea at the time, but it exacerbates pixel leakage
in atlas textures that have no border pixels (even in nearest sampling
modes).
The rest of the system won't add one automatically (since entity
creation no longer does), but the alias and iqm rendering code expect
there to be one. Fixes a segfault when starting a scene (demo etc).
There's no API yet as I need to look into the handling of qpic_t before
I can get any of this into the other renderers (or even vulkan, for that
matter).
However, the current design for slice rendering is based on glyphs (ie,
using instances and vertex pulling), with 3 strips of 3 quads, 16 verts,
and 26 indices (2 reset). Hacky testing seems to work, but real tests
need the API.
I don't know why it didn't happen during the demo loop, but going from
the start map to e1m1 caused a segfault due to the efrags for a lava
ball getting double freed (however, I do think it might be because the
ball passed through at least two leafs, while entities in the demos did
not). The double free was because SCR_NewScene (indirectly) freed all
the efrags without removing them from entities, and then the client code
deleting the entities caused the visibility components to get deleted
and thus the efrags freed a second time. Using ECS_RemoveEntities on the
visibility component ensures the entities don't have a visibility
component to remove when they are later deleted.
It's currently used only by the vulkan renderer, as it's the only
renderer that can make good use of it for alias models, but now vulkan
show shirt/pants colors (finally).
This cuts down on the memory requirements for skins by 25%, and
simplifies the shader a bit more, too. While at it, I made alias skins
nominally compatible with bsp textures: layer 0 is color, 1 is emissive,
and 2 is the color map (emissive was on 3).
As the RGB curves for many of the color rows are not linearly related,
my idea of scaling the brightest color in the row just didn't work.
Using a masked palette lookup works much better as it allows any curves.
Also, because the palette is uploaded as a grid and the coordinates are
calculated on the CPU, the system is extendable beyond 8-bit palettes.
This isn't quite complete as the top and bottom colors are still in
separate layers but their indices and masks can fit in just one, but
this requires reworking the texture setup (for another commit).
For whatever reason, I had added an extra 4 bytes to the fragment
shader's push-constants. It took me a while to figure out why renderdoc
wouldn't stop complaining about me not writing enough data.
It turns out my approach to alias skin coloring just doesn't work for
the quake data due to the color curves not having a linear relationship,
especially the bottom colors.
It works on only one layer and one mip, and assumes the provided texture
data is compatible with the image, but does support sub-image updates
(x, y location as parameters, width and height in the texture data).
The bright end of the color map is actually twice the palette value, but
I didn't understand this when I came up with the shirt/pants color
scheme for vulkan. However, the skin texture can store only 0..1, so the
mapping to 0..2 needs to be done in the shader. It looks like it works
at least better: the gold key at the end of demo1 doesn't look as bleh,
though I do get some weird colors still on ogres etc.
Currently only for gl/glsl/vulkan. However, rather than futzing with
con_width and con_height (and trying to guess good values), con_scale
(currently an integer) gives consistent pixel scaling regardless of
window size.
Well, sort of: it's still really in the renderer, but now calling
R_AddEfrags automatically updates the visibility structure as necessary,
and deleting an entity cleans up the efrags automatically. I wanted this
over twenty years ago.
I had forgotten that the cl structs in nq and qw were different layouts,
which resulted in qw's sbar/hud being quite broken. Rather than messing
with the structs, I decided it would be far better in the long run to
clean up sbar's access to the cl struct and the few other nq/qw specific
globals it used. There are still plenty of bugs to fix, but now almost
everything is in the one place.
I'm not sure when things broke on my laptop (I thought I got warp and
fisheye working on my laptop), but it turns out things weren't quite
right, thus warp (and presumably fisheye) weren't working properly due
to GLSL errors that I only just noticed. This fixes water warp (and
probably fisheye).
Much of the nq/qw HUD system is quite broken, but the basic status bar
seems to be working nicely. As is the console (both client and server).
Possibly the biggest benefit is separating the rendering of HUD elements
from the updating of them, and much less traversing of invisible views
whose only purpose is to control the positioning of the visible views.
The view flow tests are currently disabled until I adapt the flow code
to ECS.
There seems to be a problem with view resizing in that some gravities
don't follow resizing correctly.
The resource functions assume the requested layers is correct (really,
the lighting code assumes that the resource functions assume such), but
QFV_CreateImage multiplies the layer count by 6 for cube maps (really,
the issue is in QFV_CreateImage, but I want to move away from it
anyway).
The check for the entity being the view model was checking only the
view model id, which is not sufficient when the view model is invalid by
never being set to other than 0s. A better system for dealing with the
view model is needed.
Another step towards moving all resource creation into the one place.
The motivation for doing the change was getting my test scene to work
with only ambient lights or no lights at all.
It seems this isn't needed any more (not sure why) as both glsl and
vulkan are happy without it. Also unsure why moving to ECS made gl and
sw change behavior regarding rendering the test models in my scene.
While the libraries are probably getting a little out of hand, the
separation into its own directory is probably a good thing as an ECS
should not be tied to scenes. This should make the ECS more generally
useful.
This fixes the segfault due to the world entity not actually existing,
without adding a world entity. It takes advantage of the ECS in that the
edge renderer needs only the world matrix, brush model pointer, and the
animation frame number (which is just 0/1 for brush models), thus the
inherent SOA of ECS helps out, though benchmarking is needed to see if
it made any real difference.
With this, all 4 renderers are working again.
Since entity_t has a pointer to the registry owning the entity, there's
no need to access a global to get at the registry. Also move component
getting closer to where it's used.
It was being set to -1 unconditionally due to forgetting to use id.
However, I decided I didn't like reusing the id var and did some
renaming while I was at it.
This puts the hierarchy (transform) reference, animation, visibility,
renderer, active, and old_origin data in separate components. There are
a few bugs (crashes on grenade explosions in gl/glsl/vulkan, immediately
in sw, reasons known, missing brush models in vulkan).
While quake doesn't really need an ECS, the direction I want to take QF
does, and it does seem to have improved memory bandwidth a little
(uncertain). However, there's a lot more work to go (especially fixing
the above bugs), but this seems to be a good start.
Not that anything is actually rendered yet, but the validation layers
don't like the null render pass. Came up now because ctf1 seems to make
the first light an ambient light.
It didn't really add anything of value as the glyph bitmap rects and the
bearings were never used together, and the rest of the fields were
entirely redundant. A small step towards using a component system for
text.
The inconsistencies in clang's handling of casts was bad enough, and its
silliness with certain extensions, but discovering that it doesn't
support raw strings was just too much. Yes, it gives a 3s boost to qfvis
on gmsp3v2.bsp, but it's not worth the hassle.
While the base of a memory object was aligned when calculating the
memory block size, the top end was not, which could result in the memory
block not getting enough bytes allocated to satisfy alignment
requirements (eg, for flushing).
While fixing that, I noticed the offsets of objects were not being
aligned when binding, so that is fixed as well.
Fixes Mr Fixit on my VersaPro.
With the improved atlas allocation, 2x is no longer needed and 1.2x
seems to be sufficient. Most importantly, it reduced the texture for
amiri-regular.ttf at 72 pix height from 8x to 4x (the staging buffer
isn't big enough for 8k textures).
Currently, only 16 fonts can be loaded (I need to sort out descriptor
set pools), but glyphs are grouped into batches of the same font. While
not quite optimal as it can result in bouncing between descriptor sets a
lot, it's still reasonably efficient.
Line rendering now has its own pipeline (removing the texture issue).
Glyph rendering (for fonts) has been reworked to use instanced quad
rendering, with the geometry (position and texture coords) in a static
buffer (uniform texture buffer), and each instance has a glyph index,
color, and 2d base position.
Multiple fonts can be loaded, but aren't used yet: still just the one
(work needs to be done on the queues to support multiple
textures/fonts).
Quads haven't changed much, but buffer creation and destruction has been
cleaned up to use the resource functions.
Its value on input is ignored. QFV_CreateResource writes the resource
object's offset relative to the beginning of the shared memory block.
Needed for the Draw overhaul.
I got tired of writing the same 13 or so lines of code over and over (it
actually put me off experimenting with Vulkan). Thus...
QFV_PacketCopyBuffer does the work of handling barriers and a (full
packet) copy from the staging buffer to a GPU buffer.
QFV_PacketCopyImage does a similar job, but for images. However, it
still needs a lot of work, but it does make getting a basic texture onto
the GPU much less of a hassle.
Both functions should make staging data much less error-prone.
This moves the qfv_resobj_t image initialization code from the IQM
loader into the resource management code. This will allow me to reuse
the code for setting up glyph data. As a bonus, it cleans up the IQM
code nicely.
UVs being 0 meant that lines were picking up the upper left pixel of
char 0 of conchars. With quake data, this meant a transparent pixel.
Fixes invisible debug lines :P.
It turns out that using the swapchain image for the size requirements is
unreliable: when running under renderdoc, vkGetImageMemoryRequirements
sets the memory requirements fields to 0, leading eventually to a null
memory object being passed to vkMapMemory, which does not end well.
I had missed that vkCmdCopyImage requires the source and destination
images to have exactly the same size, and I guess assumed that the
swapchain images would always be the size they said they were, but this
is not the case for tiled-optimal images. However,
vkCmdCopyImageToBuffer does the right thing regardless of the source
image size.
This fixes the skewed screenshots when the window size is not a multiple
of 8 (for me, might differ for others).
There's a problem with screenshot capture in that the image is sheared
after window resize, but the screen view looks good, and vulkan is happy
with the state changes.
As gbuf_base derives from the base pipeline, it inherits base's dynamic
setting, and thus doesn't need its own. I had a FIXME there as I wasn't
sure why I had a redundant setting, but I really can't see why I'd want
it different from any of the other main renderpass pipelines.
I've found and mostly isolated the parts of the code that will be
affected by window resizing, minus pipelines but they use dynamic
viewport and scissor settings and thus shouldn't be affected so long as
the swapchain format doesn't change (how does that happen?)
Finally, the model_funcs and render_funcs struts use designated
initializers. Not only are they good for ensuring correct
initialization, they're great for the programmer finding the right
initializer.
With the addition of dependencies on freetype and harfbuzz, it became
clear that the renderer plugins need to be explicitly linked against
external dependencies (and that I need to do more installed testing,
rather than just my static local builds). This fixes the unresolved
symbols when attempting to load any of the plugins.
qwaq doesn't supply a backtile pic, so Draw_TileClear in the gl and glsl
renderers would segfault when qwaq's window width changed due to some
back-tile being drawn.
As of a recent nvidia driver update, it became necessary to enable the
feature. I guess older drivers (or vulkan validation layers?) were a bit
slack in their checking (or perhaps I didn't actually get those lighting
changes working at the time despite having committed them).
This did involve changing some field names and a little bit of cleanup,
but I've got a better handle on what's going on (I think I was in one of
those coding trances where I quickly forget how things work).
This makes bsp traversal even more re-entrant (needed for shadows).
Everything needed for a "pass" is taken from bsp_pass_t (or indirectly
through bspctx_t (though that too might need some revising)).
Ambient lights are represented by a point at infinity and a zero
direction vector (spherical lights have a non-zero direction vector but
the cone angle is 360 degrees). This fixes what appeared to be mangled
light renderers (it was actually just an ambient light being treated as
a directional light (point at infinity, but non-zero direction vector).
There are some issues with the light renderers getting mangled, and only
the first light is even touched (just begin and end of render pass), but
this gets a lot of the framework into place.
Sounds odd, but it's part of the problem with calling two different
things with essentially the same name. The "high level" render pass in
question may be a compute pass, or a complex series of (Vulkan) render
passes and so won't create a Vulkan render pass for the "high level"
render pass (I do need to come up with a better name for it).
I really don't remember why I made it separate, though it may have been
to do with r_ent_queue. However, putting it together with the rest is
needed for the "render pass" rework.
It now lives in vulkan_renderpass.c and takes most of its parameters
from plist configs (just the name (which is used to find the config),
output spec, and draw function from C). Even the debug colors and names
are taken from the config.
QFV_CreateRenderPass is no longer used, and QFV_CreateFramebuffer hasn't
been used for a long time. The C file is still there for now but is
basically empty.
The real reason for the delay in implementing support for pNext is I
didn't know how to approach it at the time, but with the experience I've
gained using and modifying vkparse, the solution turned out to be fairly
simple. This allows for the use of various extensions (eg, multiview,
which was used for testing, though none of the hookup is in this
commit). No checking is done on the struct type being valid other than
it must be of a chainable type (ie, have its own pNext).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Enabled by 'developer lighting'. It was good for confirming that the
lights in ad_e1m1 (Doom Hangar 16) were actually being output (over 600
of them sometimes, ouch). Turned out to be the color scale ambiguity.
Surfaces marked with SURF_DRAWALPHA but not SURF_DRAWTURB are put in a
separate queue for the water shader and run with a turb scale of 0.
Also, entities with colormod alpha < 1 are marked to go in the same
queue as SURF_DRAWALPHA surfaces (ie, no SURF_DRAWTURB unless the
model's texture indicated such).
A listener is used instead of (really, as well as) ie_app_window events
because systems that need to know about windows sizes may not have
anything to do with input and the event system.
This breaks console scaling for now (con_width and con_height are gone),
but is a major step towards window resize support as console stuff
should never have been in viddef_t in the first place.
The client screen init code now sets up a screen view (actually the
renderer's scr_view) that is passed to the client console so it can know
the size of the screen. The same view is used by the status bar code.
Also, the ram/cache/paused icon drawing is moved into the client screen
update code. A bit of duplication, but I do plan on merging that
eventually.
More tuning is needed on the actual splits as it falls over when the
lower rect gets too low for the subrects being allocated. However, the
scrap allocator itself will prefer exact width/height fits with larger
cutoff over inexact cuts with smaller cutoff. Many thanks to tdb for the
suggestions.
Fixes the fps dropping from ~3700fps down to ~450fps (cumulative due to
loss of POT rounding and very poor splitting layout), with a bonus boost
to about 4900fps (all speeds at 800x450). The 2d sprites were mostly ok,
but the lightmaps forming a capital gamma shape in a 4k texture really
hurt. Now the lightmaps are a nice dense bar at the top of the texture,
and 2d sprites are pretty good (slight improvement coming next).
It handles basic cursor motion respecting \r \n \f and \t (might be a
problem for id chars), wraps at the right edge, and automatically
scrolls when the cursor tries to pass the bottom of the screen.
Clearing the buffer resets its cursor to the upper left.
QFS_LoadFile closes its file argument (this is a design error resulting
from changing QFS_LoadFile to take a file instead of a path and not
completing the update), resulting in the call to Qfilesize accessing
freed memory.
This is intended for the built-in 8x8 bitmap characters and quake's
"conchars", but could potentially be used for any simple (non-composed
characters) mono-spaced font. Currently, the buffers can be created,
destroyed, cleared, scrolled vertically in either direction, and
rendered to the screen in a single blast.
One of the reasons for creating the buffer is to make it so scaling can
be supported in the sw renderer.
While this does pull the grovelling for the subpic out to the callers,
the real problem is the excessive use of qpic_t in the internal code:
qpic_t is really just the image format in wad files, and shouldn't be
used as a generic image handle.
Cleans up more of the icky code in the font drawing functions.
This makes working with quads, implied alpha quads, and lines much
cleaner (and gets rid of the bulk of the "eww" fixme), and will probably
make it easier to support multiple scraps and fonts, and potentially
more flexible ordering between pipelines.
This means that QF should support more exotic fonts without any issue
(once the rest of the text handling system is up to snuff) as HarfBuzz
does all the hard work of handling OpenType, Graphite, etc text shaping,
including kerning (when enabled).
Also, font loading now loads all the glyphs into the atlas (preload is
gone).
While the results are a little surprising (tends to alternate between
left side and top for allocations), there is much less wasted space in
the partially allocated regions, and the main free region seems to
always be quite big.
It is currently an ugly hack for dealing with the separate quad queue,
and the pipeline handling code needs a lot of cleanup, but it works
quite well, though I do plan on moving to HarfBuzz for text shaping. One
nice development is I got updating of descriptor sets working (just need
to ensure the set is no longer in use by the command queue, which the
multiple frames in flight makes easy).
It's implemented only in the Vulkan renderer, partly because there's a
lot of experimenting going on with it, but the glyphs do get transferred
to the GPU (checked in render doc). No rendering is done yet: still
thinking about whether to do a quick-and-dirty test, or to add HarfBuzz
immediately, and the design surrounding that.
R8G8B8A8 was hard-coded by accident when creating Vulkan_LoadTexArray
(or probably even the original Vulkan_LoadTex). This wasn't a problem
while everything was loaded in that format, but attempting to load an R8
texture didn't go so well. The same format as the image itself is used
now (correctly so).
I have recently learned that pre-multiplied alpha is the correct way to
do compositing, which is pretty much what the 2d pass does (actually,
all passes, but...). This required ensuring the color factor passed to
the fragment shader is pre-multiplied (a little silly for cshifts as
they used to be pre-multiplied but were un-pre-multiplied early in QF's
history and I don't feel like fixing that right now as it affects all
renderers), and also pre-multiplying alpha when converting from 8-bit
palette to rgba as the palette entry for transparent has that funky pink
(which is used in full-brights).
I will need to do more work to improve the 2d allocation, but rounding
up the requested sizes to the next power of two proved to be excessively
wasteful: I was able to allocate spots for only half of the sub-pics I
needed (though I did still need to double the number of pixels in the
end).
The software renderer uses Bresenham's line slice algorithm as presented
by Michael Abrash in his Graphics Programming Black Book Special Edition
with the serial numbers filed off (as such, more just so *I* can read
the code easily), along with the Chen-Sutherland line clipping
algorithm. The other renderers were more or less trivial in comparison.
Most were pretty easy and fairly logical, but gib's regex was a bit of a
pain until I figured out the real problem was the conditional
assignments.
However, libs/gamecode/test/test-conv4 fails when optimizing due to gcc
using vcvttps2dq (which is nice, actually) for vector forms, but not the
single equivalent other times. I haven't decided what to do with the
test (I might abandon it as it does seem to be UD).
This does mean that the gl and sw renderers can no longer call
S_ExtraUpdate, but really, they shouldn't be anyway. And I seem to
remember it not really helping (been way too long since quake ran that
slowly for me).
The texture animation data is compacted into a small struct for each
texture, resulting in much less data access when animating the texture.
More importantly, no looping over the list of frames. I plan on
migrating this to at least the other hardware renderers.
I found a test map with no texture data. Even after fixing the bsp
loader, vulkan didn't like it. Now vulkan is happy. The "Missing" text
is full-bright magenta on a dim grey background so it should be visible
in any lighting conditions.
Conflagrant Rodent has a sub-model with 0 faces (double bit error?)
causing simply counting faces to get out of sync with actual model
starts thus breaking *all* brush models that come after it (including
other maps). Thus be a little less lazy in figuring out model start
faces.
The models are broken up into N sub-(sub-)models, one for each texture,
but all faces using the same texture are drawn as an instance, making
for both reduced draw calls and reduced index buffer use (and thus,
hopefully, reduced bandwidth). While texture animations are broken, this
does mark a significant milestone towards implementing shadows as it
should now be possible to use multiple threads (with multiple index and
entid buffers) to render the depth buffers for all the lights.
This allows the use of an entity id to index into the entity data and
fetch the transform and colormod data in the vertex shader, thus making
instanced rendering possible. Non-world brush entities are still not
rendered, but the world entity is using both the entity data buffer and
entid buffer.
Sub-models and instance models need an instance data buffer, but this
gets the basics working (and the proof of concept). Using arrays like
this actually simplified a lot of the code, and will make it easy to get
transparency without turbulence (just another queue).
The gl water warp ones have been useless since very early on due to not
doing water warp in gl (vertex warping just didn't work well), and the
recent water warp implementation doesn't need those hacks. The rest of
the removed flags just aren't needed for anything. SURF_DRAWNOALPHA
might get renamed, but should be useful for translucent bsp surfaces
(eg, vines in ad_tears).
One more step towards BSP thread-safety. This one brought with it a very
noticeable speed boost (ie, not lost in the noise) thanks to the face
visframes being in tightly packed groups instead of 128 bytes apart,
though the sw render's boost is lost in the noise (but it's very
fill-rate limited).
This is next critical step to making BSP rendering thread-safe.
visframe was replaced with cluster (not used yet) in anticipation of BSP
cluster reconstruction (which will be necessary for dealing with large
maps like ad_tears).
The main goal was to get visframe out of mnode_t to make it thread-safe
(each thread can have its own visframe array), but moving the plane info
into mnode_t made for better data access patters when traversing the bsp
tree as the plane is right there with the child indices. Nicely, the
size of mnode_t is the same as before (64 bytes due to alignment), with
4 bytes wasted.
Performance-wise, there seems to be very little difference. Maybe
slightly slower.
The unfortunate thing about the change is the plane distance is negated,
possibly leading to some confusion, particularly since the box and
sphere culling functions were affected. However, this is so point-plane
distance calculations can be done with a single 4d dot product.
The map uses 41% of a 4k light map scrap, and 512 texture descriptors
wasn't enough for vulkan. Ouch. I do need to get cvars on these things,
but this will do for now (decades later...)
This was one of the biggest reasons I had trouble understanding the bsp
display list code, but it turns out it was for dealing with GLES's
16-bit limit on vertex indices. Since vulkan uses 32-bit indices,
there's no need for the extra layer of indirection. I'm pretty sure it
was that lack of understanding that prevented me from removing it when I
first converted the glsl bsp code to vulkan (ie, that 16-bit indices
were the only reason for elements_t).
It's hard to tell whether the change makes much difference to
performance, though it seems it might (noisy stats even over 50 timedemo
loops) and the better data localization indicate it should at least be
just as good if not better. However, the reason for the change is
simplifying the data structures so I can make bsp rendering thread-safe
in preparation for rendering shadow maps.
And maybe a nano-optimization. Switching from (~side + 1) to (-side)
seems to give glsl a very tiny speed boost, but certainly doesn't hurt.
Looking at some assembly output for the three cases, the two hacks seem
to generate the same code as each other, but 3 instructions vs 6 for ?:.
While ?: is more generically robust, the hacks are tuned for the
knowledge side is either 0 or 1. The later xor might alter things, but
at least I now know that the hack (either version) is worthwhile.
With experience, I have found that trying to continue after a validation
error tends to result in a segfault or some other nastiness, and
Sys_Shutdown (and the full shutdown sequence) is triggered for any error
signal (segfault, abort, etc) so just exit(1).
Some very much needed comments :P Still, nicely, I now have a much
better understanding of how the display lists are created (10 years
is a long time to remember how intricate code works (I do remember
fighting to get it working back then))
This makes it much easier to see just what is being done to build a
polygon to be passed to the GPU, and it served as a test for the
lightmap st changes since Vulkan currently never used them.
Many modders use negative lights for interesting effects, but vulkan
doesn't like the result of a negative int treated as unsigned when it
comes to texture sizes.
However, this time it doesn't modify the light array when it sorts the
lights by size since the lights are now located before the renderer gets
to see them, and having the fix up the light leafs array would be too
painful (and probably the completely wrong thing to do anyway: the light
array should be treated as constant by the renderer). 1.6GB of memory
for gmsp3v2's lights (a little better than marcher: more smaller lights?).
For reference:
gmsp3v2: shadow maps: 8330 layers in 29 images: 1647706112
marcher: shadow maps: 2440 layers in 11 images: 2358575104
For now, at least (I have some ideas to possibly reduce the numbers and
also to avoid the need for actual limits). I've seen gmsp3v2 use over
500 lights at once (it has over 1300), and I spent too long figuring out
that weird light behavior was due to the limit being hit and lights
getting dropped (and even longer figuring out that more weird behavior
was due to the lack of shadows and the world being too bright in the
first place).
Since the staging buffer allocates the command buffers it uses, it
needs to free them when it is freed. I think I was confused by the
validation layers not complaining about unfreed buffers when shutting
down, but that's because destroying the pool (during program shutdown,
when the validation layers would complain) frees all the buffers. Thus,
due to staging buffers being created and destroyed during the level load
process, (rather large) command buffers were piling up like imps in a
Doom level.
In the process, it was necessary to rearrange some of the shutdown code
because vulkan_vid_render_shutdown destroys the shared command pool, but
the pool is required for freeing the command buffers, but there was a
minor mess of long-lived staging buffers being freed afterwards. That
didn't end particularly well.
This is meant for a "permanent" tear-down before freeing the memory
holding the VM state or at program shutdown. As a consequence, builtin
sub-systems registering resources are now required to pass a "destroy"
function pointer that will be called just before the memory holding
those resources is freed by the VM resource manager (ie, the manager
owns the resource memory block, but each subsystem is responsible for
cleaning up any resources held within that block).
This even enhances thread-safety in rua_obj (there are some problems
with cmd, cvar, and gib).
This gives a rather significant speed boost to timedemo demo1: from
about 2300-2360fps up to 2520-2600fps, at least when using
multi-texture.
Since it was necessary for testing the scrap, gl got the ability to set
the console background texture, too.
While it takes one extra step to grab the marksurface pointer,
R_MarkLeaves and R_MarkLights (the two actual users) seem to be either
the same speed or fractionally faster (by a few microseconds). I imagine
the loss gone to the extra fetch is made up for by better bandwidth
while traversing the leafs array (mleaf_t now fits in a single cache
line, so leafs are cache-aligned since hunk allocations are aligned).
Unfortunately, the animations are pre-baked (by the loader) blocking
run-time determined animations (IK etc). However, this at least gets
everything working so the basics can be verified (the shader posed some
issue resulting in horror movies ;).
Brush models looked a little too tricky due to the very different style
of command queue, so that's left for now, but alias, iqm and sprite
entities are now labeled. The labels are made up of the lower 5 hex
digits of the entity address, the position, and colored by the
normalized position vector. Not sure that's the best choice as it does
mean the color changes as the entity moves, and can be quite subtle
between nearby entities, but it still helps identify the entities in the
command buffer.
And, as I suspected, I've got multiple draw calls for the one ogre. Now
to find out why.
The bones aren't animated yet (and I realized I made the mistake of
thinking the bone buffer was per-model when it's really per-instance (I
think this mistake is in the rest of QF, too)), skin rendering is a
mess, need to default vertex attributes that aren't in the model...
Still, it's quite satisfying seeing Mr Fixit on screen again :)
I wound up moving the pipeline spec in with the rest of the pipelines as
the system isn't really ready for separating them.
The plists can now be accessed by name and the forward render pass
config is available (but not used, or tested beyond syntax). I was going
to have the IQM pipeline spec separate but ran into limitations in the
system (which needs a lot of polish, really).
That @inherit is pretty useful :) This makes it much easier to see how
different pipelines differ or how they are the similar. It also makes it
much clearer which sub-pass they're for.
I was wondering why scaled-down quake-guy was dimmer than full-size
quake-guy. And the per-fragment normalization gives the illusion of
smoothness if you don't look at his legs (and even then...).
I'm not sure what's up with the weird lighting that results from dynamic
lights being directional (sunlight works nicely in marcher, but it has a
unit vector for position).
The parsing of light data from maps is now in the client library, and
basic light management is in scene. Putting the light loading code into
the Vulkan renderer was a mistake I've wanted to correct for a while.
The client code still needs a bit of cleanup, but the basics are working
nicely.
This replaces *_NewMap with *_NewScene and adds SCR_NewScene to handle
loading a new map (for quake) in the renderer, and will eventually be
how any new scene is loaded.
This leaves only the one conditional in the shader code, that being the
distance check. It doesn't seem to make any noticeable difference to
performance, but other than explosion sprites being blue, lighting
quality seems to have improved. However, I really need to get shadows
working: marcher is just silly-bright without them, and light levels
changing as I move around is a bit disconcerting (but reasonable as
those lights' leaf nodes go in and out of visibility).
Id Software had pretty much nothing to do with the vulkan renderer (they
still get credit for code that's heavily based on the original quake
code, of course).
It's not used yet, and thus may have some incorrect settings, but I
decided that I will probably want it at some stage for qwaq. It's
essentially was was in the original spec, but updated for some of the
niceties added to parsing since I removed it back then. It's also in its
own file.
Despite the base IQM specification not supporting blend-shapes, I think
IQM will become the basis for QF's generic model representation (at
least for the more advanced renderers). After my experience with .mu
models (KSP) and unity mesh objects (both normal and skinned), and
reviewing the IQM spec, it looks like with the addition of support for
blend-shapes, IQM is actually pretty good.
This is just the preliminary work to get standard IQM models loading in
vulkan (seems to work, along with unloading), and they very basics into
the renderer (most likely not working: not tested yet). The rest of the
renderer seems to be unaffected, though, which is good.
The resource subsystem creates buffers, images, buffer views and image
views in a single batch operation, using a single memory object to back
all the buffers and images. I had been doing this by hand for a while,
but got tired of jumping through all those vulkan hoops. While it's
still a little tedious to set up the arrays for QFV_CreateResource (and
they need to be kept around for QFV_DestroyResource), it really eases
calculation of memory object size and sub-resource offsets. And
destroying all the objects is just one call to QFV_DestroyResource.
Vulkan doesn't appreciate the empty buffers that result from the model
not having any textures or surfaces that can be rendered (rightfully so,
for such a bare-metal api).
I doubt the calls were ever actually made in a normal map due to the
node actually being a node when breaking out of the loop, but when I
experimented with an empty world model (no nodes, one infinite empty
leaf) I found that visit_leaf was getting called twice instead of once.
Since it is updated every frame, it needs to be as fast as possible for
the cpu code. This seems to make a difference of about 10us (~130 ->
~120) when testing in marcher. Not a huge change, but the timing
calculation was wrapped around the entire base world pass, so there was
a fair bit of overhead from bsp traversal etc.
It makes a significant difference to level load times (approximately
halves them for demo1 and demo2). Nicely, it turns out I had implemented
the rest of the staging buffer code (in particular, flushing) correctly
in that it seems there's no corruption any of the data.
The misinterpretations were due to either the cvar not being accessed
directly by the engine, but via only the callback, or the cvars were
accesssed only by progs (in which case, they should be float). The
remainder are a potential enum (hud gravity) and a "too hard basket"
(rcon password: need to figure out how I want to handle secret strings).
This is an extremely extensive patch as it hits every cvar, and every
usage of the cvars. Cvars no longer store the value they control,
instead, they use a cexpr value object to reference the value and
specify the value's type (currently, a null type is used for strings).
Non-string cvars are passed through cexpr, allowing expressions in the
cvars' settings. Also, cvars have returned to an enhanced version of the
original (id quake) registration scheme.
As a minor benefit, relevant code having direct access to the
cvar-controlled variables is probably a slight optimization as it
removed a pointer dereference, and the variables can be located for data
locality.
The static cvar descriptors are made private as an additional safety
layer, though there's nothing stopping external modification via
Cvar_FindVar (which is needed for adding listeners).
While not used yet (partly due to working out the design), cvars can
have a validation function.
Registering a cvar allows a primary listener (and its data) to be
specified: it will always be called first when the cvar is modified. The
combination of proper listeners and direct access to the controlled
variable greatly simplifies the more complex cvar interactions as much
less null checking is required, and there's no need for one cvar's
callback to call another's.
nq-x11 is known to work at least well enough for the demos. More testing
will come.
This allows for easy (and safe) printing of cexpr values where the type
supports it. Types that don't support printing would be due to being too
complex or possibly write-only (eg, password strings, when strings are
supported directly).
Surprisingly, only two, but they were caught by the different value
fields being used, thus the cvar was checked in multiple places. I
imagine that's not really all that common, so there may be some
inconsistencies between default value and use.
This allows a single render pass description to be used for both
on-screen and off-screen targets. While Vulkan does allow a VkRenderPass
to be used with any compatible frame buffer, and vkparse caches a
VkRenderPass created from the same description, this allows the same
description to be used for a compatible off-screen target without any
dependence on the swapchain. However, there is a problem in the caching
when it comes to targeting outputs with different formats.
As I had suspected, it's due to a synchronization problem between the
scrap and drawing. There's actually a double problem in that data
uploaded to the scrap isn't flushed until the first frame is rendered
causing a quick init-shutdown sequence to take at least five seconds due
to the staging buffer waiting (and timing out) on a stuck fence.
Rendering just one frame "fixes" the problem (draw was one of the
earliest subsystems to get going in vulkan).
Surprisingly, only two, but they were caught by the different value
fields being used, thus the cvar was checked in multiple places. I
imagine that's not really all that common, so there may be some
inconsistencies between default value and use.
Since it is updated every frame, it needs to be as fast as possible for
the cpu code. This seems to make a difference of about 10us (~130 ->
~120) when testing in marcher. Not a huge change, but the timing
calculation was wrapped around the entire base world pass, so there was
a fair bit of overhead from bsp traversal etc.
Really, this won't make all that much difference because alias models
with more than one skin are quite rare, and those with animated skin
groups are even rarer. However, for those models that do have more than
one skin, it will allow for reduced allocation overheads, and when
supported (glsl, vulkan, maybe gl), loading all the skins into an array
texture (since all skins are the same size, though external skins may
vary), but that's not implemented yet, this just wraps the old one skin
at a time code.
While looking at the deferred attachment images with using a template in
mind, I noticed that the opaque attachment was using 8-bit color. The
problem is, it's meant to be HDRI with the compose pass crunching it
down to LDRI. Switching to 16-bit float does seem to have made a subtle
difference (hey, it's still quake data, not much HDRI in there).
That certainly makes it nicer to work with large sets, and shows one way
to be careful with allocated resources: don't allocate them in the
inherited data and use a template that needs a few things filled in to
be valid. Also, it seems that overriding values in sub-structures "just
works" :)
It simply parses the referenced plist dictionary (via @inherit =
plist.path;) into the current data block, then allows the data to be
overwritten by the current plist dictionary. This may be a bit iffy for
any allocated resources, so some care must be taken, but it seems to
work nicely.
This allows a single render pass description to be used for both
on-screen and off-screen targets. While Vulkan does allow a VkRenderPass
to be used with any compatible frame buffer, and vkparse caches a
VkRenderPass created from the same description, this allows the same
description to be used for a compatible off-screen target without any
dependence on the swapchain. However, there is a problem in the caching
when it comes to targeting outputs with different formats.
This makes much more sense as they are intimately tied to the frame
buffer on which a render pass is working. Now, just the window width
and height are stored in vulkan_ctx_t. As a side benefit,
QFV_CreateSwapchain no long references viddef (now just palette and
conview in vulkan_draw.c to go).
While I have trouble imagining it making that much performance
difference going from 4 verts to 3 for a whopping 2 polygons, or even
from 2 triangles to 1 for each poly, using only indices for the vertices
does remove a lot of code, and better yet, some memory and buffer
allocations... always a good thing.
That said, I guess freeing up a GPU thread for something else could make
a difference.
I think I had gotten lucky with captures not being corrupt due to them
being much bigger than all but the L3 cache (and then they're over 1/2
the size), so the memory was being automatically invalidated by other
activity. Don't want to trust such luck, though.
It makes a significant difference to level load times (approximately
halves them for demo1 and demo2). Nicely, it turns out I had implemented
the rest of the staging buffer code (in particular, flushing) correctly
in that it seems there's no corruption any of the data.
This means that a tex_t object is passed in instead of just raw bytes
and width and height, but it means the texture can specify whether it's
flipped or uses BGR instead of RGB. This fixes the upside down
screenshots for vulkan.
This fixes (*ahem*) the vulkan renderer segfaulting when attempting to
take a screenshot. However, the image is upside down. Also, remote
snapshots and demo capture are broken for the moment.
QFS_NextFilename was renamed to QFS_NextFile to reflect the fact it now
returns a QFile pointer for the newly created file (as well as the
name). This necessitated updating WritePNG to take a file pointer
instead of a file name, with the advantage that WritePNGqfs is no longer
necessary and callers have much more control over the creation of the
file.
This makes QFS_NextFile much more secure against file system race
conditions and attacks (at least in theory). If nothing else, it will
make it more robust in a multi-threaded environment.
It's not there yet as it promptly closes the file and returns only the
filename (and then only the portion within the user's directory tree).
However, this worked nicely as a test for Sys_UniqueFile.
It seems clang defaults to unsigned for enums. Interestingly, gcc was ok
with the checks being either way. I guess gcc treats enums that *can* be
unsigned as DWIM.
Still work with gcc, of course, and I still need to fix them properly,
but now they're actually slightly easier to find as they all have vec_t
and FIXME on the same line.
Viewport and FOV updates are now separate so updating one doesn't cause
recalculations of the other. Also, perspective setup is now done
directly from the tangents of the half angles for fov_x and fov_y making
the renderers independent of fov/aspect mode. I imagine things are a bit
of a mess with view size changes, and especially screen size changes
(not supported yet anyway), and vulkan winds up updating its projection
matrices every frame, but everything that's expected to work does
(vulkan errors out for fisheye or warp due to frame buffer creation not
being supported yet).
If the entity didn't have a known model type, R_StoreEfrags would get
stuck in an infinite loop (fortunately, never actually happened. The
result of making it not call Sys_Error for unknown models)).
I meant to do this a while ago but forgot about it. Things are a bit of
a mess in that the renderer knows too much about entities, but
eventually the renderer will know about only things to render (meshes,
particles, etc).
The quake-specific enums are now in the client header, and the particle
system now has a gravity field rather than getting it from
vid_render_data (which I hope to eventually get rid of entirely).
r_refdef is really meant for holding the various screen "constants" for
the software renderer rather than the more generic scene stuff. All the
fields referenced by the low level rendering code (especially assembly)
have been moved to the beginning of the struct (and nicely fit within 64
bytes). The other fields should be moved elsewhere, but not this commit.
On top of that, R_ViewChanged is much easier to read, and there are
fewer static globals.
Now GL perspective matrix setup matches that of GLSL and Vulkan, and
GL's z_up matrix matches GLSL's (as it should, since they're really
going through the same API). GL also needs the depth adjustmet matrix
now. Other than having to google the docs for glFrustum, there's nothing
wrong with the function itself, but it's nice to have direct control
over the matrices.
In the process, I discovered how horribly confused I've been at times
with respect to the handedness of GL and Quake: GL is right-handed
(y-up, z-out, x-right), as is Quake itself (but z-up, y-left, x-in), but
as the perspective matrix used in the three renderers expects z-in,
having x-right and y-up makes the matrix effectively left-handed (not
for Vulkan though, because there it's y-down, x-right, z-up, so
right-handed again).
Of course, it's not as correct as glsl or sw due to using polygons and
uvs rather than a fragment shader (not that such is out of the question
since GL 3.0 is requested, but I don't feel like getting shaders going
just for a couple of post-processing effects in an obsolete renderer).
While it's not where I want it to be, it at least now no longer messes
with frame buffer binding or the view ports. This involved switching
around buffers in D_WarpScreen so that the main buffer could be bound
before post-processing.
The cvar setup for particles is a bit wonky in that the arrays get
initialized using the default max particle count but never updated.
Though things could be improved some more, this solution works (and has
been more or less copied to gl, but I couldn't reproduce the crash
there, or even the valgrind error).
The code dealing with state is a bit of a mess, but everything is
working nicely. Get around 400fps when all 6 faces need to be rendered
(no surprise: it should be about 1/6 of that for normal rendering). The
messy state handling code did not come as a surprise as I suspected
there were various mistakes in my scene rendering "recipe", and fisheye
highlighted them nicely (I'm sure getting this stuff working in Vulkan
will highlight even more issues).
Finally, after a decade :P Looks pretty good, too, and is (almost)
properly scaled to the resolution (almost because the effect is a little
squashed, but I think the sw renderer does the same).
The GLSL compiler requires any #version lines to be the first (real)
line of the program, even #line causes an error, so if the first line of
the chunk starts with #version, insert the #line directive as the second
line.
Again, gl/vulkan not working yet (on the assumption that sw would be
trickier).
Fisheye overrides water warp because updating the projection map every
frame is far too expensive.
I've added a post-process pass to the interface in order to hide the
implementation details, but I'm not sure I'm happy about how the
multi-pass rendering for cube maps is handled (or having the frame
buffers as exposed as they are), but mainly because Vulkan will make
implementation interesting.
For now, OpenGL and Vulkan renderers are broken as I focused on getting
the software renderer working (which was quite tricky to get right).
This fixes a couple of issues: the segfault when warping the screen (due
to the scene rendering move invalidating the warp buffer), and warp
always having 320x200 resolution. There's still the problem of the
effect being too subtle at high resolution, but that's just a matter of
updating the tables and tweaking the code in D_WarpScreen.
Another issue is the Draw functions should probably write directly to
the main frame buffer or even one passed in as a parameter. This would
remove the need for binding the main buffer at the beginning and end of
the frame.
This used to be handled by R_RenderView (encompassing all of the
rendering) before the scene rendering was moved out to r_screen. This
fixes the stuck time in 32-bit nq-win.
Its guts have been moved to D_Init temporarily while I work on the
frame buffer design. This is actually a big part of that work as it
moves most of the frame buffer creation into the one place, making it
easier to ensure I get all the sub-buffers and caches created.
With what I have planned for frame buffers etc, GL 3.0 will be needed
even for the fixed-function GL renderer, and then I might even take the
GLSL renderer to 4.6 (dunno yet). This means that wgl will need to be
updated too, and I've found the info I need for that, but it's a bit
much to take on just yet.
I think the widespread use of recalc_refdef (and force_fullscreen) was
the result of a rushed merge of the renderer and video code (I do seem
to remember sprinkling them around). This cleans the two out of the
client code.
This avoids the possibility of a singularity (and thus the temptation to
use Sys_Error). While the rendering is rubbish, 0 degrees is allowed
because values less than 1 should be allowed, but where does one stop?
170 is the maximum in order to avoid any issues with (near) parallel or
inverted frustum planes (or other fun things) in the low level code.
Other than the view model (undecided on the approach) this has
R_RenderView pretty much pulled out of the low level renderers. With
this, I'll be able to focus on scene handling for a bit then getting
shadows and fisheye working (again for fisheye).
r_screen isn't really the right place, but it gets the scene rendering
out of the low-level renderers and will make it easier to sort out
later, and hopefully easier to figure out a good design for vulkan.
gl_overbright_f shouldn't need to run through any entity queues to
update the light maps as only the world model has light maps, and
hitting the world model should hit all its sub-models.
The change to using separate per-model-type entity queues resulted in
the lighting vector used for alias and iqm models being in an ephemeral
location (in the shared setup_lighting function's stack frame). This
resulted in the model rendering code getting a garbage vector due to it
being overwritten by another stack frame. What I don't get is why the
garbage varied from run to run for the same demo (demo2, the first scrag
behind the start door showed the bad lighting nicely), which made
tracking down the offending commit (and thus the code) rather
troublesome, though once I found it, it was a bit of a face-palm moment.
Move r_pcurrentvertbase into the sw renderer, cleaning up gl's use of
(not really needed there). Not ready to move r_bsp into the main bin yet
as there are linking issues since only the low-level code references any
of its symbols.
The code is really part of scene (not a typo wrt r_screen: that is
misnamed as such, or at least SCR_UpdateScreen needs to be split into
screen (2d overlay, really) and scene updates).
This breaks fisheye rendering as the fisheye code calls the actual scene
render code multiple times, but the fisheye code is called by said scene
render code via a diversion. The fisheye needs to be moved out to the
high level scene render, but that will takes some extra work for frame
buffer setup.
The two aren't compatible (but warping might be doable in the fisheye
code). The whole frame setup code needs a rework, and really, even the
buffer handling.
It being on the stack was a bad idea as R_RenderWorld returns before the
scans are rendered and thus the entity pointer winds up pointing to
abandoned stack space.
While the scheme of using our own allocated did work just fine, fisheye
rendering uses glGenTextures which caused a texture id clash and thus
invalid operations (the cube map texture happened to be the same as the
console background texture). Sure, I could have just "fixed" the fisheye
init code, but this brings gl closer in line with glsl (which makes
extensive use of glGenTextures and glDeleteTextures). This doesn't fix
any texture leaks gl has (plenty, I imagine), but it's a step in the
right direction.
Only for gl and sw at the moment (want to merge things further before I
do anything for glsl or vulkan). However, with with I've learned getting
gl and sw to work, glsl and vulkan will be trivial.
R_RecursiveLightUpdate has been obsolete for a very long time, and
R_Mirror is just wrong (needs envmaps etc, wonder if it can be done in
the fixed function code using skyclip?)
Finally. I never liked it (felt bad adding it in the first place), and
it has caused confusion with function and global variable names, but it
did let me get the render plugins working.
This moves the common camera setup code out of the individual drivers,
and completely removes vup/vright/vpn from the non-software renderers.
This has highlighted the craziness around AngleVectors with it putting
+X forward, -Y right and +Z up. The main issue with this is it requires
a 90 degree pre-rotation about the Z axis to get the camera pointing in
the right direction, and that's for the native sw renderer (vulkan needs
a 90 degree pre-rotation about X, and gl and glsl need to invert an
axis, too), though at least it's just a matrix swizzle and vector
negation. However, it does mean the camera matrices can't be used
directly.
Also rename vpn to vfwd (still abbreviated, but fwd is much clearer in
meaning (to me, at least) than pn (plane normal, I guess, but which
way?)).
I'd been considering it for a while, but in the end, all the issues it
presented made me decide it wasn't worth merging and was never really
worth keeping: it was a neat proof of concept but of little actual use,
especially now everyone either has an OK GPU or would want to stick to
8-bit rendering anyway (sorry L-Havoc).
However, both it and my merge work are preserved in git history :)
16 and 32 bit rendering are disabled at the moment because there's a
weird segfault I need to fix, but the 8-bit dynamic lights are doing
weird things (for x11, too) when updating the light maps.
I got tired of having to maintain two separate software renderers, but
didn't want to just nuke sw32, so its core changes are merged into sw.
Alias model rendering is broken, but I know exactly what's wrong and how
to fix it, just need to take care due to asm.