They are both gone, and pr_pointer_t is now pr_ptr_t (pointer may be a
little clearer than ptr, but ptr is consistent with things like intptr,
and keeps the type name short).
This required delaying the setting of the return pointer by call until
after the current pointer had been saved, and thus passing the desired
pointer into PR_CallFunction (which does have some advantages for C
functions calling progs functions, but some dangers too (should ensure a
128 byte (32 word) buffer when calling untrusted code (which is any,
really)).
This fixes the issue of the data stack not being restored properly
because the returning function needs to return a value from its local
variables (stored on the stack) and accessing stack data below the stack
pointer is a bad idea (sure, no interrupts yet, but who knows...).
Call's operand c is used to specify where the return value of the
function is to be stored. This gets both the correct function being
called, and the value being returned correctly. Test still fails due to
the stack restoration issue.
It currently fails for two reasons:
- call's mode selection is incorrect (never updated from when there was
only the one call instruction and the mode was encoded in operand c)
- return should automatically restore the stack pointer to the value it
had on entry to the function, thus allowing local values stored on
the stack to be safely returned.
I don't know why they were ever signed (oversight at id and just
propagated?). Anyway, this resulted in "unsigned" spreading a bit, but
all to reasonable places.
This has been a long-held wishlist item, really, and I thought I might
as well take the opportunity to add the instructions. The double
versions of STATE require both the nextthink field and time global to be
double (but they're not resolved properly yet: marked with
"FIXME double time" comments).
Also, the frame number for double time state is integer rather than
float.
In the end, I decided any/all/none should be separate from the other
horizontal ops, if I even do them (can be implemented by first
converting to bool, then using the appropriate horizontal operation (& |
etc).
ANY/ALL/NONE have been temporarily removed until I implement the HOPS
(horizontal operations) sub-instructions, which will all both 32-bit and
64-bit operands and several other operations (eg, horizontal add).
All the fancy addressing modes for the conditional branch instructions
have been permanently removed: I decided the gain was too little for the
cost (24 instructions vs 6). JUMP and CALL retain their addressing
modes, though.
Other instructions have been shuffled around a little to fill most of
the holes in the upper block of 256 instructions: just a single small
7-instruction hole.
Rearrangements in the actual engine are mostly just to keep the code
organized. The only real changes were the various IF statements and
dealing with the resulting changes in their addressing.
When creating the tests for lea, I noticed that B was yet another simple
assign, so I decided it was best to drop it and move E into its place
(freeing up another instruction).
Most useful for 64-bit values as only one instruction is needed to move
the data around rather than two, but could be slightly faster for 32-bit
as the addressing is simpler (needs profiling).
The compare/ne operator returns "random" -ve, 0, +ve values (really,
just the numerical difference between the chars of the strings), but all
the rest return -1 for true and 0 for false, as with the rest of the
comparison operators.
Does not include string concatenation because I don't feel like messing
with zone init, but all the other operators are tested (currently
failing due to bool convention)
It calculating only the size of the array (which was often 4 or 8
globals per element) proved to be a pain when I forgot to alter the size
for the new scale tests. Fixing the size calculation even found a bug in
the shiftop tests.
It seems casting from float/double to [unsigned] int/long when the value
doesn't fit is undefined (which would explain the inconsistent results).
Mentioning the possibility seems like a good idea should the results for
such casts change and cause the tests to fail.
Bools turned out to be a problem to due to me wanting any non-zero value
to be treated as true thus had to expand them out as well as the
floating point <-> integral conversions.
They currently fail because for vector values, gcc casts the view, not
the value, so vec4 cast to ivec4 simply views the bits as int rather
than doing the actual conversion.
Rather than specifying that the conversion should be skipped, it now
specifies the mode of the conversions (with 0 being no conversion). This
is in preparation for boolean conversion.
I realized that being able to do bit-wise operations with 64-bit values
(and 256-bit vectors) is far more important than some convenient boolean
logic operators. The logic ops can be handled via the bit-wise ops so
long as the values are all properly boolean, and I plan on adding some
boolean conversion ope, so no real loss.
Both float 2,3,4 vectors and double 2,3,4 vectors (1 would be just a
copy of the mul instructions).
This completes the currently planned instructions. Now for testing.
Not all possibilities are supported because converting between int and
uint, and long and ulong is essentially a no-op. However, thanks to
Deek's suggestion, not only are all reasonable conversions available,
conversions for all widths are available, so vector conversions are
supported.
The code for the conversions is generated.
Thanks to Deek for the suggestion: the mode (ie, src and dst types) are
encoded in st->b. Actual code not written yet, but this frees up 13
instructions: now have 74 available for really interesting stuff :)
The call1-8 instructions have been removed as they are really not needed
(they were put in when I had plans of simple translation of v6p progs to
ruamoko, but they joined the dinosaurs).
The call instruction lost mode A (that is now return) and its mode B is
just the regular function access. The important thing is op_c (with
support for with-bases) specifies the location of the return def.
The return instruction packs both its addressing mode and return value
size into st->c as a 3.5 value: 3 bits for the mode (it supports all
five addressing modes with entity.field being mode 4) and 5 for the
size, limiting return sizes to 32 words, which is enough for one 4x4
double matrix.
This, especially with the following convert patch, frees up a lot of
instructions.
Now they're in a much more consistent arrangement, in particular with
the comparison opcodes if the conditional branch instructions are
considered to be fast comparisons with zero (ifnot -> ifeq, if -> ifne,
etc). Unconditional jump and call fill in the gaps. The goal was to get
them all in an arrangement that would work as a small enum for qfcc: it
can use the enum directly for the ruamoko IS, and a small map array for
v6p (except for call).
Both pr_type_size and pr_type_name. I want to macroize the enum, but
need to sort out the clutter of headers first, just need to decide on
naming. This at least sorts out the missed values for now.
The bug (alignment issues with AVX on windows) seems to have in gcc from
the 4.x days, and is still present in 11.2: it does not ensure stack
parameters that need 32 byte alignment are aligned. Telling gcc to use
the sysv abi (safe on a static function) lets gcc do what it does for
linux (usually pass the parameters in registers, which it seems to have
done).
And partial implementations in qfcc (most places will generate an
internal error (not implemented) or segfault, but some low-hanging fruit
has already been implemented).
As I expect to be tweaking things for a while, it's part of the build
process. This will make it a lot easier to adjust mnemonics and argument
formats (tweaking the old table was a pain when conventions changed).
It's not quite done as it still needs arg widths and types.
While working on the new opcode table, I decided a lot of the names were
not to my liking. Part of the problem was the earlier clash with the
v6p opcode names, but that has been resolved via the v6p tag.
Use the new "1" versions of loadvec3 to get a 1 in w to avoid
divide-by-zero errors, and use the correct type for longs (forgot to
change i to l on the vector types).
It turned out I had no way of using a pointer or field as the value to
load, so all 4 modes are duplicated with loads from where operand b
points, but the loaded value interpreted the same way. Also, fixed an
error in the calculation of op-b offsets.
Statements can be bounds checked in the one place (jump calculation),
but memory accesses cannot as they can be used in lea instructions which
should never cause an exception (unless one of lea's operands is OOB).
* / % %% + -
As a bonus, includes partial tests for a few extra operators. Several
things are broken at this stage, but uncommitted code is already
working.
Float bit-ops as well.
Also, add q*v4 and v4*q instructions. There are currently 48 free
opcodes, and I might remove the scale instructions, but they could be
useful as expanding a single float to a vector would take 3 instructions
(copy to temp, swizzle-expand temp, multiply, vs just scale).
The swizzle instruction is very powerful in that in can do any of the
256 permutations of xyzw, optionally negate any combination of the
resulting components, and zero any combination of the result components
(even all). This means the one instruction can take care of any actual
swizzles, conjugation for complex and quaternion values, zeroing vectors
(not that it's the only way), and probably other weird things.
The python file was used to generate the jump table and actual swizzle
code.
They even found a bug in the addressing mode functions :) (I'd forgotten
that I wanted signed offsets from the pointer and thus forgot to cast
st->b to short in order to get the sign extension)
This allows the VM to select the right execution loop and qfcc currently
still produces only the old IS (it doesn't know how to deal with the new
IS yet)
When it's finalized (most of the conversion operations will go, probably
the float bit ops, maybe (very undecided) the 3-component vector ops,
and likely the CALLN ops), this will be the actual instruction set for
Ruamoko.
Main features:
- Significant reduction in redundant instructions: no more multiple
opcodes to move the one operand size.
- load, store, push, and pop share unified addressing mode encoding
(with the exception of mode 0 for load as that is redundant with mode
0 for store, thus load mode 0 gives quick access to entity.field).
- Full support for both 32 and 64 bit signed integer, unsigned integer,
and floating point values.
- SIMD for 1, 2, (currently) 3, and 4 components. Transfers support up
to 128-bit wide operations (need two operations to transfer a full
4-component double/long vector), but all math operations support both
128-bit (32-bit components) and 256-bit (64-bit components) vectors.
- "Interpreted" operations for the various vector sizes: complex dot
and multiplication, 3d vector dot and cross product, quaternion dot
and multiplication, along with qv and vq shortcuts.
- 4-component swizzles for both sizes (not yet implemented, but the
instructions are allocated), with the option to zero or negate (thus
conjugates for complex and quaternion values) individual components.
- "Based offsets": all relevant instructions include base register
indices for all three operands allowing for direct access to any of
four areas (eg, current entity, current stack frame, Objective-QC
self, ...) instructions to set a register and push/pop the four
registers to/from the stack.
Remaining work:
- Implement swizzle operations and a few other stragglers.
= Make a decision about conversion operations (if any instructions
remain, they'll be just single-component (at 14 meaningful pairs,
that's a lot of instructions to waste on SIMD versions).
- Decide whether to keep CALL1-CALL8: probably little point in
supporting two different calling conventions, and it would free up
another eight instructions.
- Unit tests for the instructions.
- Teach qfcc to generate code for the new instruction set (hah, biggest
job, I'm sure, though hopefully not as crazy as the rewrite eleven
years ago).
I wish I'd done it this way years ago (but maybe gcc 2.95 couldn't hack
the casts, I do know there were aliasing problems in the past). Anyway,
this makes operand access much more consistent for variable sized
operands (eg float vs double vs vec4), and is a big part of the new
instruction set implementation.
There is no reasonable way (due to hardware-enforced alignment issues)
to simply convert old bytecode to new (probably best done with an
off-line tool, preferably just recompiling when I get qfcc up to the
job), so both loops will need to be present. This just moves the
original loop into its own function in order to make it easy to bring in
the new (and iron out integration issues).
The opcode table is a nightmare to maintain, but this does clean it up
and speed up opcode lookups since they can now be indexed. Of course, it
turns out I had missed adding several instructions, so had to fix that,
and qfcc needed a bit of a re-jigger to get the opcode out of the table.
The switch from using pr_functions (dfunction_t) to function_table
(bfunction_t) for keeping track of the current function (and thus
profiling data) broke PR_Profile as it never saw anything but 0.