The bug (alignment issues with AVX on windows) seems to have in gcc from
the 4.x days, and is still present in 11.2: it does not ensure stack
parameters that need 32 byte alignment are aligned. Telling gcc to use
the sysv abi (safe on a static function) lets gcc do what it does for
linux (usually pass the parameters in registers, which it seems to have
done).
While working on the new opcode table, I decided a lot of the names were
not to my liking. Part of the problem was the earlier clash with the
v6p opcode names, but that has been resolved via the v6p tag.
Use the new "1" versions of loadvec3 to get a 1 in w to avoid
divide-by-zero errors, and use the correct type for longs (forgot to
change i to l on the vector types).
It turned out I had no way of using a pointer or field as the value to
load, so all 4 modes are duplicated with loads from where operand b
points, but the loaded value interpreted the same way. Also, fixed an
error in the calculation of op-b offsets.
Statements can be bounds checked in the one place (jump calculation),
but memory accesses cannot as they can be used in lea instructions which
should never cause an exception (unless one of lea's operands is OOB).
Float bit-ops as well.
Also, add q*v4 and v4*q instructions. There are currently 48 free
opcodes, and I might remove the scale instructions, but they could be
useful as expanding a single float to a vector would take 3 instructions
(copy to temp, swizzle-expand temp, multiply, vs just scale).
The swizzle instruction is very powerful in that in can do any of the
256 permutations of xyzw, optionally negate any combination of the
resulting components, and zero any combination of the result components
(even all). This means the one instruction can take care of any actual
swizzles, conjugation for complex and quaternion values, zeroing vectors
(not that it's the only way), and probably other weird things.
The python file was used to generate the jump table and actual swizzle
code.
They even found a bug in the addressing mode functions :) (I'd forgotten
that I wanted signed offsets from the pointer and thus forgot to cast
st->b to short in order to get the sign extension)
This allows the VM to select the right execution loop and qfcc currently
still produces only the old IS (it doesn't know how to deal with the new
IS yet)
When it's finalized (most of the conversion operations will go, probably
the float bit ops, maybe (very undecided) the 3-component vector ops,
and likely the CALLN ops), this will be the actual instruction set for
Ruamoko.
Main features:
- Significant reduction in redundant instructions: no more multiple
opcodes to move the one operand size.
- load, store, push, and pop share unified addressing mode encoding
(with the exception of mode 0 for load as that is redundant with mode
0 for store, thus load mode 0 gives quick access to entity.field).
- Full support for both 32 and 64 bit signed integer, unsigned integer,
and floating point values.
- SIMD for 1, 2, (currently) 3, and 4 components. Transfers support up
to 128-bit wide operations (need two operations to transfer a full
4-component double/long vector), but all math operations support both
128-bit (32-bit components) and 256-bit (64-bit components) vectors.
- "Interpreted" operations for the various vector sizes: complex dot
and multiplication, 3d vector dot and cross product, quaternion dot
and multiplication, along with qv and vq shortcuts.
- 4-component swizzles for both sizes (not yet implemented, but the
instructions are allocated), with the option to zero or negate (thus
conjugates for complex and quaternion values) individual components.
- "Based offsets": all relevant instructions include base register
indices for all three operands allowing for direct access to any of
four areas (eg, current entity, current stack frame, Objective-QC
self, ...) instructions to set a register and push/pop the four
registers to/from the stack.
Remaining work:
- Implement swizzle operations and a few other stragglers.
= Make a decision about conversion operations (if any instructions
remain, they'll be just single-component (at 14 meaningful pairs,
that's a lot of instructions to waste on SIMD versions).
- Decide whether to keep CALL1-CALL8: probably little point in
supporting two different calling conventions, and it would free up
another eight instructions.
- Unit tests for the instructions.
- Teach qfcc to generate code for the new instruction set (hah, biggest
job, I'm sure, though hopefully not as crazy as the rewrite eleven
years ago).
I wish I'd done it this way years ago (but maybe gcc 2.95 couldn't hack
the casts, I do know there were aliasing problems in the past). Anyway,
this makes operand access much more consistent for variable sized
operands (eg float vs double vs vec4), and is a big part of the new
instruction set implementation.
There is no reasonable way (due to hardware-enforced alignment issues)
to simply convert old bytecode to new (probably best done with an
off-line tool, preferably just recompiling when I get qfcc up to the
job), so both loops will need to be present. This just moves the
original loop into its own function in order to make it easy to bring in
the new (and iron out integration issues).
The server edict arrays are now stored outside of progs memory, only the
entity data itself (ie data accessible to progs via ent.fld) is stored in
progs memory. Many of the changes were due to code accessing edicts and
entity fields directly rather than through the provided macros.
And rename prd_exit to prd_terminate (the idea is the host will
terminate the VM). This makes it possible for the debugger to pause the
VM before any code, even a builtin function, is executed. Breaks the
debugger source window, but only because it's not updating on file
change (I think).
I decided I want events for VM enter/exit but enter needs to somehow
pass the function which will be executed (even if a builtin). A generic
void * param seemed the best idea, which meant the error string could be
passed via the param instead of a "global" string in the progs struct.
While there was a breakpoint hook, it was for only breakpoints and more
was needed. Now there's a generic hook that is called for tracing,
breakpoints, watch points, runtime errors and VM errors, with the
"event" type passed as the first parameter and a data pointer in the
second.
The memset instructions now match the move* instructions other than the
first operand (always int). Probably breaks much, but fixed in next few
commits.
If a temp string is found in the return slot, PR_FreeTempStrings won't
delete the string. However, PR_PopFrame was blindly stomping on the
possibly surviving temp string with the push strings, which would cause
a leak.
This "pushes" a temp string onto the callee's stack frame after removing
it from the caller's stack frame. This is so builtins can pass
auto-freed memory to called progs code. No checking is done, but mayhem
is likely to ensue if a string is pushed that was allocated in an
earlier frame.
The progs execution code will call a breakpoint handler just before
executing an instruction with the flag set. This means there's no need
for the breakpoint handler to mess with execution state or even the
instruction in order to continue past the breakpoint.
The flag being set in a progs file is invalid.
It is now set to 0 when progs are loaded and every time
PR_ExecuteProgram() returns. This takes care of the default case, but
when setting parameters, pr_argc needs to be set correctly in case a
vararg function is called.
PR_SaveParams() is required for implementing the +initialize diversion
used by Objective-QuakeC because builtins do not have local def spaces
(of course, a normal stack calling convention would help). However, it
is entirely possible for a call to +initialize to trigger another call
to +initialize, thus the need for stacking parameter stashes. As a
bonus, this implementation cleans up some fields in progs_t.
The engine now requires non-v6 progs to store the log2 alignment for the
param struct in .param_alignment.
PR_EnterFunction is clearer and possibly more efficient.