Now they're in a much more consistent arrangement, in particular with
the comparison opcodes if the conditional branch instructions are
considered to be fast comparisons with zero (ifnot -> ifeq, if -> ifne,
etc). Unconditional jump and call fill in the gaps. The goal was to get
them all in an arrangement that would work as a small enum for qfcc: it
can use the enum directly for the ruamoko IS, and a small map array for
v6p (except for call).
Both pr_type_size and pr_type_name. I want to macroize the enum, but
need to sort out the clutter of headers first, just need to decide on
naming. This at least sorts out the missed values for now.
The bug (alignment issues with AVX on windows) seems to have in gcc from
the 4.x days, and is still present in 11.2: it does not ensure stack
parameters that need 32 byte alignment are aligned. Telling gcc to use
the sysv abi (safe on a static function) lets gcc do what it does for
linux (usually pass the parameters in registers, which it seems to have
done).
And partial implementations in qfcc (most places will generate an
internal error (not implemented) or segfault, but some low-hanging fruit
has already been implemented).
As I expect to be tweaking things for a while, it's part of the build
process. This will make it a lot easier to adjust mnemonics and argument
formats (tweaking the old table was a pain when conventions changed).
It's not quite done as it still needs arg widths and types.
While working on the new opcode table, I decided a lot of the names were
not to my liking. Part of the problem was the earlier clash with the
v6p opcode names, but that has been resolved via the v6p tag.
Use the new "1" versions of loadvec3 to get a 1 in w to avoid
divide-by-zero errors, and use the correct type for longs (forgot to
change i to l on the vector types).
It turned out I had no way of using a pointer or field as the value to
load, so all 4 modes are duplicated with loads from where operand b
points, but the loaded value interpreted the same way. Also, fixed an
error in the calculation of op-b offsets.
Statements can be bounds checked in the one place (jump calculation),
but memory accesses cannot as they can be used in lea instructions which
should never cause an exception (unless one of lea's operands is OOB).
* / % %% + -
As a bonus, includes partial tests for a few extra operators. Several
things are broken at this stage, but uncommitted code is already
working.
Float bit-ops as well.
Also, add q*v4 and v4*q instructions. There are currently 48 free
opcodes, and I might remove the scale instructions, but they could be
useful as expanding a single float to a vector would take 3 instructions
(copy to temp, swizzle-expand temp, multiply, vs just scale).
The swizzle instruction is very powerful in that in can do any of the
256 permutations of xyzw, optionally negate any combination of the
resulting components, and zero any combination of the result components
(even all). This means the one instruction can take care of any actual
swizzles, conjugation for complex and quaternion values, zeroing vectors
(not that it's the only way), and probably other weird things.
The python file was used to generate the jump table and actual swizzle
code.
They even found a bug in the addressing mode functions :) (I'd forgotten
that I wanted signed offsets from the pointer and thus forgot to cast
st->b to short in order to get the sign extension)
This allows the VM to select the right execution loop and qfcc currently
still produces only the old IS (it doesn't know how to deal with the new
IS yet)
When it's finalized (most of the conversion operations will go, probably
the float bit ops, maybe (very undecided) the 3-component vector ops,
and likely the CALLN ops), this will be the actual instruction set for
Ruamoko.
Main features:
- Significant reduction in redundant instructions: no more multiple
opcodes to move the one operand size.
- load, store, push, and pop share unified addressing mode encoding
(with the exception of mode 0 for load as that is redundant with mode
0 for store, thus load mode 0 gives quick access to entity.field).
- Full support for both 32 and 64 bit signed integer, unsigned integer,
and floating point values.
- SIMD for 1, 2, (currently) 3, and 4 components. Transfers support up
to 128-bit wide operations (need two operations to transfer a full
4-component double/long vector), but all math operations support both
128-bit (32-bit components) and 256-bit (64-bit components) vectors.
- "Interpreted" operations for the various vector sizes: complex dot
and multiplication, 3d vector dot and cross product, quaternion dot
and multiplication, along with qv and vq shortcuts.
- 4-component swizzles for both sizes (not yet implemented, but the
instructions are allocated), with the option to zero or negate (thus
conjugates for complex and quaternion values) individual components.
- "Based offsets": all relevant instructions include base register
indices for all three operands allowing for direct access to any of
four areas (eg, current entity, current stack frame, Objective-QC
self, ...) instructions to set a register and push/pop the four
registers to/from the stack.
Remaining work:
- Implement swizzle operations and a few other stragglers.
= Make a decision about conversion operations (if any instructions
remain, they'll be just single-component (at 14 meaningful pairs,
that's a lot of instructions to waste on SIMD versions).
- Decide whether to keep CALL1-CALL8: probably little point in
supporting two different calling conventions, and it would free up
another eight instructions.
- Unit tests for the instructions.
- Teach qfcc to generate code for the new instruction set (hah, biggest
job, I'm sure, though hopefully not as crazy as the rewrite eleven
years ago).
I wish I'd done it this way years ago (but maybe gcc 2.95 couldn't hack
the casts, I do know there were aliasing problems in the past). Anyway,
this makes operand access much more consistent for variable sized
operands (eg float vs double vs vec4), and is a big part of the new
instruction set implementation.
There is no reasonable way (due to hardware-enforced alignment issues)
to simply convert old bytecode to new (probably best done with an
off-line tool, preferably just recompiling when I get qfcc up to the
job), so both loops will need to be present. This just moves the
original loop into its own function in order to make it easy to bring in
the new (and iron out integration issues).
And add a unary op macro. Having VectorCompOp makes it easy to write
macros that work for multiple data widths, which is why it and its users
now use (dst, ...) instead of (..., dst) as in the past. I'll sort out
the other macros later now that I know the compiler handily gives
messages about the switched order (uninitialized vars etc).
For int, long, float and double. I've been meaning to add them for a
while, and they're part of the new Ruamoko instructions set (which is
progressing nicely).
The opcode table is a nightmare to maintain, but this does clean it up
and speed up opcode lookups since they can now be indexed. Of course, it
turns out I had missed adding several instructions, so had to fix that,
and qfcc needed a bit of a re-jigger to get the opcode out of the table.
The list of all allocated dispatch tables is used to free all the tables
when the progs are reloaded. Not clearing the list meant that the next
instance (second map change) corrupted the list.
Forgetting to unhook the functions (Sys_Printf and the client console's
input event handler) was not a problem for static builds because the
functions were always present, but in builds with dynamic plugins, the
client console's code got ripped away and thus Sys_Printf and the event
hander were being sent into invalid memory. Too much work, not enough
play (with a fully installed client).
The switch from using pr_functions (dfunction_t) to function_table
(bfunction_t) for keeping track of the current function (and thus
profiling data) broke PR_Profile as it never saw anything but 0.
Even NUM_FOR_BAD_EDICT will have a bad day if the edict pointer is
invalid, so make sure that the entity pointer is valid (within the edict
area AND a multiple of edict size).
PR_LoadDebug now does only the initial version and crc checks, and the
byte-swapping of the loaded symbols file. PR_DebugSetSym sets up all the
pointers.
The homogeneous coord was not being initialized and thus was picking up
rubbish from the stack. This is why the test would succeed in some
circumstances but fail in others.
Forgetting to invoke [super dealloc] in a derived class's -dealloc
method has caused me to waste far too much time chasing down the
resulting memory leaks and crashes. This is actually the main focus of
issue #24, but I want to take care of multiple paths before I consider
the issue to be done.
However, as a bonus, four cases were found :)
Fixes axis inputs being half what they should be. Can't quite get +1,
though (need to figure something out for the positive axis range being
slightly smaller than the negative range).
With some hacks that are not included (plan on handling events and
contexts properly), button inputs, including using listeners, are
working nicely: my little game is working again. While the trampoline
code was a bit repetitive (and I do want to clean that up), connecting
button listeners directly to Ruamoko instance methods proved to be quite
nice.
mtwist_rand_0_1 produces numbers in the range [0, 1) and
mtwist_rand_m1_1 produces numbers in the range (-1, 1). The numbers will
not be denormal, so the distribution should be fairly uniform (as much
as Mersenne Twister itself is), but this needs proper testing.
0 is included for the mtwist_rand_0_1 as it seems useful, but -1 is not
included in mtwist_rand_m1_1 in order to keep the extremes of the
distribution balanced around 0.
And create rua_game to coordinate other game builtins.
Menus are broken for key handling, but have been since the input rewrite
anyway. rua_input adds the ability to create buttons and axes (but not
destroy them). More work needs to be done to flesh things out, though.
This takes care of the global variables to a point (there is still the
global struct shared between the non-vulkan renderers), but it also
takes care of glsl's points-only rendering.
After yesterday's crazy marathon editing all the particles files, and
starting to do another big change to them today, I realized that I
really do need to merge them down. All the actual spawning is now in the
client library (though particle insertion will need to be moved). GLSL
particle rendering is semi-broken in that it now does only points (until
I come up with a way to select between points and quads (probably a
context object, which I need anyway for Vulkan)).
This may seem a little contradictory, but it's due to the difference
between a high level (engine) render pass and a Vulkan render pass
object (and quite likely a poor choice in names for the high level
object). This is necessary for supporting compute shader dispatches as
they cannot be submitted inside a Vulkan render pass.
This has the advantage of getting entity_t out of the particle system,
and much easier to read math. Also, it served as a nice test for my
particle physics shaders (implemented the ideas in C). There's a lot of
code that needs merging down: all but the actual drawing can be merged.
There's some weirdness with color ramps, but I'll look into that later.
They should increment by one for each pic, not 4 (I think some fluff
remaining from copying glsl's draw code).
I noticed the problem when I saw large gaps of 0s in the vertex data in
renderdoc.
This gets the crosshair working in Vulkan (next commit) and fixes issues
with changing the palette (though I've never seen a different palette
for quate, there's still the change from "all black" to an actual
palette).
This was needed to get crosshaircolor working correctly, but is likely
another step towards resizable windows (the listener set types are
generic for any viddef event, not just palette changes).
Holding onto the pointer is not a good idea, and it is read-only as
direct manipulation of the world matrix is not supported. However, this
is useful for passing the matrix to the GPU.
This means color, emission, and translucent. Fixes the HOM issues on my
VersaPro (but halves the frame-rate... definitely need to bring back the
forward renderer as an option).
This gets the pipelines loaded (and unloaded on shutdown). Probably the
easy part :P. Still need to sort out the command buffers,
synchronization, and particle generation (and probably a bunch else
that's not coming to mind).
This needed changing Vulkan_CreatePipeline to
Vulkan_CreateGraphicsPipeline for consistency (and parsing the
difference from a plist seemed... not worth thinking about).
It turned out the bindless approach wouldn't work too well for my design
of the sprite objects, but I don't think that's a big issue at this
stage (and it seems bindless is causing problems for brush/alias
rendering via renderdoc and on my versa pro). However, I have figured
out how to make effective use of descriptor sets (finally :P).
The actual normal still needs checking, but the sprites are currently
unlit so not an issue at this stage.
I'm not at all sure what I was thinking when I designed it, but I
certainly designed it wrong (to the point of being fairly useless). It
turns out memory requirements are already aligned in size (so just
multiplying is fine), and what I really wanted was to get the next
offset aligned to the given requirements.
This adds the shaders and the pipeline specs. I'm not sure that the
deferred rendering side of the render pass is appropriate, but I thought
I'd give it a go, since quake sprites are really cutoff rather than
translucent.
With the switch to multi-layer textures for brush models, the bsp and
alias texture descriptor sets became identical and thus the definitions
shareable. However, due to complications I don't want to address yet,
they're still separately identified, but I should be able to use the
texture set for most, if not all, pipelines.
The vertices and frame images are loaded into the one memory object,
with the vertices first followed by the images.
The vertices are 2D xy+uv sets meant to be applied to the model
transform frame, and are pre-computed for the sprite size (this part
does support sprites with varying frame image sizes).
The frame images are loaded into one image with each frame on its own
layer. This will cause some problems if any sprites with varying frame
image sizes are found, but the three sprites in quake are all uniform
size.
As much as it can be since the texture data is interleaved with the
model data in the files (I guess not that bad a design for 25 years ago
with the tight memory constraints), but this paves the way for
supporting sprites in Vulkan.
The cache system pointers are now indices into an array of
cache_system_t blocks, allowing them to be 32 bits instead of 64, thus
allowing cache_system_t to fit into a single CPU cache line. This still
gives and effective 38 bits (256GB) of addressing for cache/hunk. This
does mean that the cache functions cannot work with more than 256GB, but
should that become a problem, cache and working hunking hunk can be
separate, and it should be possible to have multiple cache systems.
There's no point in zeroing out memory that is only going to be
overwritten by the loaded file (excess bytes beyond the end of a
massaged text file shouldn't be accessed anyway, and the terminating
null is still written).
This is needed for cleaning up excess memsets when loading files because
Hunk_RawAllocName has nonnull on its hunk pointer (as the rest of the
hunk functions really should, but not just yet).
In trying to reduce unnecessary memsets when loading files, I found that
Hunk_RawAllocName already had nonnull on it, so quakefs needed to know
the hunk it was to use. It seemed much better to to go this way (first
step in what is likely to be a lengthy process) than backtracking a
little and removing the nonnull attribute.
As the sw renderer's implementation was the closest to id's, it was used
as the model (thus a fair bit of cleanup is still needed). This fixes
some incorrect implementations in glsl and gl.
I'd forgotten (when doing the original brush texture loader) that
turbulent surfaces were unlit and thus always full-bright, then never
wrote the turb shader to take care of it. The best solution seems to be
to just mix the two colors in the shader as it will allow turb surfaces
to be lit in the future (probably with severely limited light counts due
to being a forward renderer).
This gets the alias pipeline in line with the bsp pipeline, and thus
everything is about as functional as it was before the rework (minus
dealing with large texture sets).
I guess it's not quite bindless as the texture index is a push constant,
but it seems to work well (and I may have fixed some full-bright issues
by accident, though I suspect that's just my imagination, but they do
look good).
This should fix the horrid frame rate dependent behavior of the view
model.
They are also in their own descriptor set so they can be easily shared
between pipelines. This has been verified to work for Draw.
BSP textures are now two-layered with the albedo and emission in the two
layers rather than two separate images. While this does increase memory
usage for the textures themselves (most do not have fullbright pixels),
it cuts down on image and image view handles (and shader resources).
Smashing everything in the process :P (need to work on the C side).
However, while bindless is supposedly good for performance, the biggest
gain this will bring is portability: the texture counts are
automatically limited to what the hardware can handle, and the reliance
on push descriptors is removed (though they were nice and did help get
things up and running).
I had forgotten that the parameters are in reverse order, and even if I
had remembered, I forgot to reset offset before the second loop.
Pre-decrementing offset takes care of both issues at once.
My VersaPro doesn't support more than 32 per-stage samplers (lavapipe).
This is a small part of getting Vulkan to run on lavapipe and even in
itself is rather incomplete.
This allows using references in expressions, eg:
$frames.size * size_t($properties.limits.maxSamplers)
As references remain property list items until actually evaluated.
Fixes the warning about parse_fixed_array not being used (oops, the
problem with partial commits), but more importantly, gives access to
things like maxDescriptorSetSamplers.
This will make property list expressions easier to work with. The
library is rather limited right now (trig, dot, min/max/bound) but even
just min adds a lot of functionality.
For now, just dot product, trig, and min/max/bound, but it works well as
a proof of concept. The main goal was actually min. Only the list of
symbols is provided, it is the user's responsibility to set up the
symbol table and context.
cexpr's symbol tables currently aren't readily extended, and dynamic
scoping is usually a good thing anyway. The chain of contexts is walked
when a symbol is not found in the current context's symtab, but minor
efforts are made to avoid checking the same symtab twice (usually cased
by cloning a context but not updating the symtab).
I want to support reading VkPhysicalDeviceLimits but it has some arrays.
While I don't need to parse them (VkPhysicalDeviceLimits should be
treated as read-only), I do need to be able to access them in property
list expressions, and vkgen generates the cexpr type descriptors too.
However, I will probably want to parse arrays some time in the future.
This ensures that unused parser blocks do not get emitted. In the
testing of the upcoming support for fixed arrays, the blend color
constants were being double emitted (both as custom and normal parser)
due to being an array. gcc did not like that (what with all those
warning flags).
Multiple render passes are needed for supporting shadow mapping, and
this is a huge step towards breaking the Vulkan render free of Quake,
and hopefully will lead the way for breaking the GL renderers free as
well.
This is actually a better solution to the renderer directly accessing
client code than provided by 7e078c7f9c.
Essentially, V_RenderView should not have been calling R_RenderView, and
CL_UpdateScreen should have been calling V_RenderView directly. The
issue was that the renderers expected the world entity model to be valid
at all times. Now, R_RenderView checks the world entity model's validity
and immediately bails if it is not, and R_ClearState (which is called
whenever the client disconnects and thus no longer has a world to
render) clears the world entity model. Thus R_RenderView can (and is)
now called unconditionally from within the renderer, simplifying
renderer-specific variants.
The generated short names for a lot of Vulkan enums start with a number
(eg VK_IMAGE_TYPE_2D -> 2d). Having to prefix the short name with ` is a
tiny cost for the convenience.
While using binary data objects for specialization data works for bools
(as they can be 0 or -1), they don't work so well for numeric values due
to having to get the byte order correct and thus are not portable, and
difficult to get right.
Binary data is still supported, but the data can be written as a string
with an array(...) "constructor" expression taking any number of
parameters, with each parameter itself being an expression (though
values are limited at this stage).
Due to the plist format, quotes are required around the expression
("array(...)")
While there may be better solutions, I needed a varargs function for
building Vulkan specialization data. Like progs functions, negative
parameter counts indicate ellipsis with the number of fixed parameters
being equal to -param_count - 1.
Sets never shrink, so assigning a dynamically created set to a
statically created set after the working size has reduced (going from
demo2 to demo3) causes the set code to attempt to resize the statically
created set, which leads to libc having a bad time.
Why nvidia's drivers accepted double-destroyed framebuffers is beyond
me, but this fixes the Intel drivers complaining about such (and the
subsequent segfault).
When I changed the matrices from an array of floats to an array of
vec4f_t, I forgot to update the flush offsets. Yay for having a
Vulkan-capable Intel device with its different alignment requirements.
When allocating memory for multiple objects that have alignment
requirements, it gets tedious keeping track of the offset and the
alignment. This is a simple function for walking the offset respecting
size and alignment requirements, and doubles as a size calculator.
IN_ButtonAction treats id 0 as not pressed in its internal processing,
and the previous input implementation treated 0 as "no key", so this is
both the simplest and most correct fix.
Fixes mouse left button not working every second time the game is run
(due to keyboard and mouse bindings swapping places in the config file
(separate issue, if it really is one)).
I'm not sure what I was thinking when I made PL_RemoveObjectForKey take
a const plitem. One of those times where C could do with being a little
more strict.
While using barriers is a zillion times better than actually grabbing
the mouse and keyboard, they're still a pain when debugging as qf is not
able to respond to the barrier-hit events. All the other logic is still
there so even when "grabbing", the mouse will not be blocked if the
window doesn't have focus.
The stack is arbitrary strings that the validation layer debug callback
prints in reverse order after each message. This makes it easy to work
out what nodes in a pipeline/render pass plist are causing validation
errors. Still have to narrow down the actual line, but the messages seem
to help with that.
Putting qfvPushDebug/qfvPopDebug around other calls to vulkan should
help out a lot, tool.
As a bonus, the stack is printed before debug_breakpoint is called, so
it's immediately visible in gdb.
Rather than just 0/1, it now acts as flags to control what messages are
printed. In addition to the Vulkan enum names (long and short), none and
all are supported (as well as raw numbers, but they're not checked for
validity). This makes vulkan_use_validation a bit easier to use and less
verbose by default.
Now, if only it was easier to remember the name :P
Id's binding of escape to togglemenu interfered with the hard-coding
(want escape to togglemenu (or console as a fallback) no matter what).
This idea was part of mercury's original design, too.
I'm not at all happy with con_message and con_menu, but fixing them
properly will take a rework of the menus (planned, though). Also, the
Menu_ console command implementations are a bit iffy and could also do
with a rewrite (probably part of the rest of the menu rework) or just
nuking (they were part of Johnny on Flame's work, so I suspect had
something to do with joystick bindings).
It seems X11 does not like creating barriers entirely off the screen,
though the error seems to be a little unreliable (however, off the left
edge was definitely bad).
An imt switcher automatically changes the context's active imt based on
a user specified list of binary inputs. The inputs may be either buttons
(indicated as +button) or cvars (bare name). For buttons, the
pressed/not pressed state is used, and cvars are interpreted as ints
being 0 or not 0. The order of the inputs determines the bit number of
the input, with the first input being bit 0, second bit 1, third bit 2
etc. A default imt is given so large switchers do not need to be fully
configured (the default imt is written to all states).
A context can have any number of switchers attached. The switchers can
wind up fighting over the active imt, but this seems to be something for
the "user" (eg, configuration system) to sort out rather than the
switcher code enforcing anything.
As a result of the inputs being treated as bits, a switcher with N
inputs will have 2**N states, thus there's a maximum of 16 inputs for
now as 65536 states is a lot of configuration.
Using a switcher, setting up a standard strafe/mouse look configuration
is fairly easy.
imt_create key_game imt_mod
imt_create key_game imt_mod_strafe imt_mod
imt_create key_game imt_mod_freelook
imt_create key_game imt_mod_lookstrafe imt_mod_freelook
imt_switcher_create mouse key_game imt_mod_strafe +strafe lookstrafe +mlook freelook
imt_switcher 0 imt_mod 2 imt_mod 4 imt_mod_freelook 8 imt_mod_freelook 12 imt_mod_freelook
imt_switcher 6 imt_mod_lookstrafe 10 imt_mod_lookstrafe 14 imt_mod_lookstrafe
in_bind imt_mod mouse axis 0 move.yaw
in_bind imt_mod mouse axis 1 move.forward
in_bind imt_mod_strafe mouse axis 0 move.side
in_bind imt_mod_lookstrafe mouse axis 0 move.side
in_bind imt_mod_freelook mouse axis 1 move.pitch
This takes advantage of imt chaining and the default imt for the
switcher (there are 8 states that use imt_mod_strafe).
The switcher name must be unique across all contexts, and every imt used
in a switcher must be in the switcher's context.
The listener is invoked when the axis value changes due to IN_UpdateAxis
or IN_ClampAxis updating the axis. This does mean the listener
invocation make be somewhat delayed. I am a tad uncertain about this
design thus it being a separate commit.
Listeners are separate to the main callback as listeners have only
read-only access to the objects, but the main callback is free to modify
the cvar and thus can act as a parser and validator. The listeners are
invoked after the main callback if the cvar is modified. There does not
need to be a main callback for the listeners to be invoked.
This allows id1/qw config files, and to a certain extent scripts, to
work with the new binding system. It does highlight just how limited the
original system was (many keys could not bound).
Mouse axis input does not work yet as that needs a little more work to
support +strafe and +mlook.