Call's operand c is used to specify where the return value of the
function is to be stored. This gets both the correct function being
called, and the value being returned correctly. Test still fails due to
the stack restoration issue.
It currently fails for two reasons:
- call's mode selection is incorrect (never updated from when there was
only the one call instruction and the mode was encoded in operand c)
- return should automatically restore the stack pointer to the value it
had on entry to the function, thus allowing local values stored on
the stack to be safely returned.
I don't know why they were ever signed (oversight at id and just
propagated?). Anyway, this resulted in "unsigned" spreading a bit, but
all to reasonable places.
This has been a long-held wishlist item, really, and I thought I might
as well take the opportunity to add the instructions. The double
versions of STATE require both the nextthink field and time global to be
double (but they're not resolved properly yet: marked with
"FIXME double time" comments).
Also, the frame number for double time state is integer rather than
float.
In the end, I decided any/all/none should be separate from the other
horizontal ops, if I even do them (can be implemented by first
converting to bool, then using the appropriate horizontal operation (& |
etc).
ANY/ALL/NONE have been temporarily removed until I implement the HOPS
(horizontal operations) sub-instructions, which will all both 32-bit and
64-bit operands and several other operations (eg, horizontal add).
All the fancy addressing modes for the conditional branch instructions
have been permanently removed: I decided the gain was too little for the
cost (24 instructions vs 6). JUMP and CALL retain their addressing
modes, though.
Other instructions have been shuffled around a little to fill most of
the holes in the upper block of 256 instructions: just a single small
7-instruction hole.
Rearrangements in the actual engine are mostly just to keep the code
organized. The only real changes were the various IF statements and
dealing with the resulting changes in their addressing.
When creating the tests for lea, I noticed that B was yet another simple
assign, so I decided it was best to drop it and move E into its place
(freeing up another instruction).
Most useful for 64-bit values as only one instruction is needed to move
the data around rather than two, but could be slightly faster for 32-bit
as the addressing is simpler (needs profiling).
The compare/ne operator returns "random" -ve, 0, +ve values (really,
just the numerical difference between the chars of the strings), but all
the rest return -1 for true and 0 for false, as with the rest of the
comparison operators.
Does not include string concatenation because I don't feel like messing
with zone init, but all the other operators are tested (currently
failing due to bool convention)
It calculating only the size of the array (which was often 4 or 8
globals per element) proved to be a pain when I forgot to alter the size
for the new scale tests. Fixing the size calculation even found a bug in
the shiftop tests.
It seems casting from float/double to [unsigned] int/long when the value
doesn't fit is undefined (which would explain the inconsistent results).
Mentioning the possibility seems like a good idea should the results for
such casts change and cause the tests to fail.
Bools turned out to be a problem to due to me wanting any non-zero value
to be treated as true thus had to expand them out as well as the
floating point <-> integral conversions.
They currently fail because for vector values, gcc casts the view, not
the value, so vec4 cast to ivec4 simply views the bits as int rather
than doing the actual conversion.
Rather than specifying that the conversion should be skipped, it now
specifies the mode of the conversions (with 0 being no conversion). This
is in preparation for boolean conversion.
I realized that being able to do bit-wise operations with 64-bit values
(and 256-bit vectors) is far more important than some convenient boolean
logic operators. The logic ops can be handled via the bit-wise ops so
long as the values are all properly boolean, and I plan on adding some
boolean conversion ope, so no real loss.
Both float 2,3,4 vectors and double 2,3,4 vectors (1 would be just a
copy of the mul instructions).
This completes the currently planned instructions. Now for testing.
Not all possibilities are supported because converting between int and
uint, and long and ulong is essentially a no-op. However, thanks to
Deek's suggestion, not only are all reasonable conversions available,
conversions for all widths are available, so vector conversions are
supported.
The code for the conversions is generated.
Thanks to Deek for the suggestion: the mode (ie, src and dst types) are
encoded in st->b. Actual code not written yet, but this frees up 13
instructions: now have 74 available for really interesting stuff :)
The call1-8 instructions have been removed as they are really not needed
(they were put in when I had plans of simple translation of v6p progs to
ruamoko, but they joined the dinosaurs).
The call instruction lost mode A (that is now return) and its mode B is
just the regular function access. The important thing is op_c (with
support for with-bases) specifies the location of the return def.
The return instruction packs both its addressing mode and return value
size into st->c as a 3.5 value: 3 bits for the mode (it supports all
five addressing modes with entity.field being mode 4) and 5 for the
size, limiting return sizes to 32 words, which is enough for one 4x4
double matrix.
This, especially with the following convert patch, frees up a lot of
instructions.
Now they're in a much more consistent arrangement, in particular with
the comparison opcodes if the conditional branch instructions are
considered to be fast comparisons with zero (ifnot -> ifeq, if -> ifne,
etc). Unconditional jump and call fill in the gaps. The goal was to get
them all in an arrangement that would work as a small enum for qfcc: it
can use the enum directly for the ruamoko IS, and a small map array for
v6p (except for call).
Both pr_type_size and pr_type_name. I want to macroize the enum, but
need to sort out the clutter of headers first, just need to decide on
naming. This at least sorts out the missed values for now.
The bug (alignment issues with AVX on windows) seems to have in gcc from
the 4.x days, and is still present in 11.2: it does not ensure stack
parameters that need 32 byte alignment are aligned. Telling gcc to use
the sysv abi (safe on a static function) lets gcc do what it does for
linux (usually pass the parameters in registers, which it seems to have
done).
And partial implementations in qfcc (most places will generate an
internal error (not implemented) or segfault, but some low-hanging fruit
has already been implemented).
As I expect to be tweaking things for a while, it's part of the build
process. This will make it a lot easier to adjust mnemonics and argument
formats (tweaking the old table was a pain when conventions changed).
It's not quite done as it still needs arg widths and types.
While working on the new opcode table, I decided a lot of the names were
not to my liking. Part of the problem was the earlier clash with the
v6p opcode names, but that has been resolved via the v6p tag.
Use the new "1" versions of loadvec3 to get a 1 in w to avoid
divide-by-zero errors, and use the correct type for longs (forgot to
change i to l on the vector types).
It turned out I had no way of using a pointer or field as the value to
load, so all 4 modes are duplicated with loads from where operand b
points, but the loaded value interpreted the same way. Also, fixed an
error in the calculation of op-b offsets.
Statements can be bounds checked in the one place (jump calculation),
but memory accesses cannot as they can be used in lea instructions which
should never cause an exception (unless one of lea's operands is OOB).
* / % %% + -
As a bonus, includes partial tests for a few extra operators. Several
things are broken at this stage, but uncommitted code is already
working.
Float bit-ops as well.
Also, add q*v4 and v4*q instructions. There are currently 48 free
opcodes, and I might remove the scale instructions, but they could be
useful as expanding a single float to a vector would take 3 instructions
(copy to temp, swizzle-expand temp, multiply, vs just scale).
The swizzle instruction is very powerful in that in can do any of the
256 permutations of xyzw, optionally negate any combination of the
resulting components, and zero any combination of the result components
(even all). This means the one instruction can take care of any actual
swizzles, conjugation for complex and quaternion values, zeroing vectors
(not that it's the only way), and probably other weird things.
The python file was used to generate the jump table and actual swizzle
code.
They even found a bug in the addressing mode functions :) (I'd forgotten
that I wanted signed offsets from the pointer and thus forgot to cast
st->b to short in order to get the sign extension)
This allows the VM to select the right execution loop and qfcc currently
still produces only the old IS (it doesn't know how to deal with the new
IS yet)
When it's finalized (most of the conversion operations will go, probably
the float bit ops, maybe (very undecided) the 3-component vector ops,
and likely the CALLN ops), this will be the actual instruction set for
Ruamoko.
Main features:
- Significant reduction in redundant instructions: no more multiple
opcodes to move the one operand size.
- load, store, push, and pop share unified addressing mode encoding
(with the exception of mode 0 for load as that is redundant with mode
0 for store, thus load mode 0 gives quick access to entity.field).
- Full support for both 32 and 64 bit signed integer, unsigned integer,
and floating point values.
- SIMD for 1, 2, (currently) 3, and 4 components. Transfers support up
to 128-bit wide operations (need two operations to transfer a full
4-component double/long vector), but all math operations support both
128-bit (32-bit components) and 256-bit (64-bit components) vectors.
- "Interpreted" operations for the various vector sizes: complex dot
and multiplication, 3d vector dot and cross product, quaternion dot
and multiplication, along with qv and vq shortcuts.
- 4-component swizzles for both sizes (not yet implemented, but the
instructions are allocated), with the option to zero or negate (thus
conjugates for complex and quaternion values) individual components.
- "Based offsets": all relevant instructions include base register
indices for all three operands allowing for direct access to any of
four areas (eg, current entity, current stack frame, Objective-QC
self, ...) instructions to set a register and push/pop the four
registers to/from the stack.
Remaining work:
- Implement swizzle operations and a few other stragglers.
= Make a decision about conversion operations (if any instructions
remain, they'll be just single-component (at 14 meaningful pairs,
that's a lot of instructions to waste on SIMD versions).
- Decide whether to keep CALL1-CALL8: probably little point in
supporting two different calling conventions, and it would free up
another eight instructions.
- Unit tests for the instructions.
- Teach qfcc to generate code for the new instruction set (hah, biggest
job, I'm sure, though hopefully not as crazy as the rewrite eleven
years ago).
I wish I'd done it this way years ago (but maybe gcc 2.95 couldn't hack
the casts, I do know there were aliasing problems in the past). Anyway,
this makes operand access much more consistent for variable sized
operands (eg float vs double vs vec4), and is a big part of the new
instruction set implementation.
There is no reasonable way (due to hardware-enforced alignment issues)
to simply convert old bytecode to new (probably best done with an
off-line tool, preferably just recompiling when I get qfcc up to the
job), so both loops will need to be present. This just moves the
original loop into its own function in order to make it easy to bring in
the new (and iron out integration issues).
And add a unary op macro. Having VectorCompOp makes it easy to write
macros that work for multiple data widths, which is why it and its users
now use (dst, ...) instead of (..., dst) as in the past. I'll sort out
the other macros later now that I know the compiler handily gives
messages about the switched order (uninitialized vars etc).
For int, long, float and double. I've been meaning to add them for a
while, and they're part of the new Ruamoko instructions set (which is
progressing nicely).
The opcode table is a nightmare to maintain, but this does clean it up
and speed up opcode lookups since they can now be indexed. Of course, it
turns out I had missed adding several instructions, so had to fix that,
and qfcc needed a bit of a re-jigger to get the opcode out of the table.
The list of all allocated dispatch tables is used to free all the tables
when the progs are reloaded. Not clearing the list meant that the next
instance (second map change) corrupted the list.
Forgetting to unhook the functions (Sys_Printf and the client console's
input event handler) was not a problem for static builds because the
functions were always present, but in builds with dynamic plugins, the
client console's code got ripped away and thus Sys_Printf and the event
hander were being sent into invalid memory. Too much work, not enough
play (with a fully installed client).
The switch from using pr_functions (dfunction_t) to function_table
(bfunction_t) for keeping track of the current function (and thus
profiling data) broke PR_Profile as it never saw anything but 0.
Even NUM_FOR_BAD_EDICT will have a bad day if the edict pointer is
invalid, so make sure that the entity pointer is valid (within the edict
area AND a multiple of edict size).
PR_LoadDebug now does only the initial version and crc checks, and the
byte-swapping of the loaded symbols file. PR_DebugSetSym sets up all the
pointers.
The homogeneous coord was not being initialized and thus was picking up
rubbish from the stack. This is why the test would succeed in some
circumstances but fail in others.
Forgetting to invoke [super dealloc] in a derived class's -dealloc
method has caused me to waste far too much time chasing down the
resulting memory leaks and crashes. This is actually the main focus of
issue #24, but I want to take care of multiple paths before I consider
the issue to be done.
However, as a bonus, four cases were found :)
Fixes axis inputs being half what they should be. Can't quite get +1,
though (need to figure something out for the positive axis range being
slightly smaller than the negative range).
With some hacks that are not included (plan on handling events and
contexts properly), button inputs, including using listeners, are
working nicely: my little game is working again. While the trampoline
code was a bit repetitive (and I do want to clean that up), connecting
button listeners directly to Ruamoko instance methods proved to be quite
nice.
mtwist_rand_0_1 produces numbers in the range [0, 1) and
mtwist_rand_m1_1 produces numbers in the range (-1, 1). The numbers will
not be denormal, so the distribution should be fairly uniform (as much
as Mersenne Twister itself is), but this needs proper testing.
0 is included for the mtwist_rand_0_1 as it seems useful, but -1 is not
included in mtwist_rand_m1_1 in order to keep the extremes of the
distribution balanced around 0.
And create rua_game to coordinate other game builtins.
Menus are broken for key handling, but have been since the input rewrite
anyway. rua_input adds the ability to create buttons and axes (but not
destroy them). More work needs to be done to flesh things out, though.
This takes care of the global variables to a point (there is still the
global struct shared between the non-vulkan renderers), but it also
takes care of glsl's points-only rendering.
After yesterday's crazy marathon editing all the particles files, and
starting to do another big change to them today, I realized that I
really do need to merge them down. All the actual spawning is now in the
client library (though particle insertion will need to be moved). GLSL
particle rendering is semi-broken in that it now does only points (until
I come up with a way to select between points and quads (probably a
context object, which I need anyway for Vulkan)).
This may seem a little contradictory, but it's due to the difference
between a high level (engine) render pass and a Vulkan render pass
object (and quite likely a poor choice in names for the high level
object). This is necessary for supporting compute shader dispatches as
they cannot be submitted inside a Vulkan render pass.
This has the advantage of getting entity_t out of the particle system,
and much easier to read math. Also, it served as a nice test for my
particle physics shaders (implemented the ideas in C). There's a lot of
code that needs merging down: all but the actual drawing can be merged.
There's some weirdness with color ramps, but I'll look into that later.
They should increment by one for each pic, not 4 (I think some fluff
remaining from copying glsl's draw code).
I noticed the problem when I saw large gaps of 0s in the vertex data in
renderdoc.
This gets the crosshair working in Vulkan (next commit) and fixes issues
with changing the palette (though I've never seen a different palette
for quate, there's still the change from "all black" to an actual
palette).
This was needed to get crosshaircolor working correctly, but is likely
another step towards resizable windows (the listener set types are
generic for any viddef event, not just palette changes).
Holding onto the pointer is not a good idea, and it is read-only as
direct manipulation of the world matrix is not supported. However, this
is useful for passing the matrix to the GPU.