When calling a builtin, normally the return pointer needs to be
restored, but if the builtin changes the call depth (usually by
effecting "return foo()" as in support for objects, but possibly
setjmp/longjmp when they are implemented), then the return pointer must
not be restored. This gets vkgen past object allocation, but it dies
when trying to send messages to super. This appears to be a compiler
bug.
Since the operand types sort out the difference between asr and shr, no
need to give them different opnames. Means qfcc doesn't need to worry
about which one it's searching for.
Yet another redundant addressing mode (since ptr + 0 can be used), so
replace it with a variable-indexed array (same as in v6p). Was forced
into noticing the problem when trying to compile Machine.r.
I abandoned the reason for doing it (adding a pile of vector types), but
I liked the cleanup. All the implementations are hand-written still, but
at least the boilerplate stuff is automated.
Of course, only in Ruamoko progs, but it works quite nicely.
global_string is now passed the absolute address of the referenced
operand. With a little groveling through the progs stack, it should be
possible to resolve pointers to locals in functions further up the
stack.
This fixes Ruamoko's return format string. It looks like it's producing
the correct address (but doesn't show all the information it should),
but the rest of the debug code needs work locals.
It turned out I need locals count and params_start for debugging, so use
the progs version instead to bail early from PR_EnterFunction and
PR_LeaveFunction (which I had forgotten anyway, oops).
They now include base register index and effective address of the
operands (though it may be wrong for instructions that don't use a base
register for that operand).
This cleans up dprograms_t, making it easier to read and see what chunks
are in it (I was surprised to see only 6, the explicit pairs made it
seem to have more).
Intel hardware requires 32-byte alignment for lvec4 and dvec4.
Unfortunately, it turns out that my attempts to align progs data in qfcc
went awry do to the order block sizes are calculated when writing the
progs.
This makes return consistent with load, store, etc, though its
addressing mode is encoded in bits 5 and 6 of c rather than the opcode.
It turns out I had no tests for any of return's addressing modes other
than basic def references, so no tests needed changing.
The parameter defs are allocated from the parameter space using a
minimum alignment of 4, and varargs functions get a va_list struct in
place of the ...
An "args" expression is unconditionally injected into the call arguments
list at the place where ... is in the list, with arguments passed
through ... coming after the ...
Arguments get through to functions now, but there's problems with taking
the address of local variables: currently done using constant pointer
defs, which can't work for the base register addressing used in Ruamoko
progs.
With the update to test-bi's printf (and a hack to qfcc for lea),
triangle.r actually works, printing the expected results (but -1 instead
of 1 for equality, though that too is actually expected). qfcc will take
a bit longer because it seems there are some design issues in address
expressions (ambiguity, and a few other things) that have pretty much
always been there.
PR_SetupParams is new and sets up the parameter pointers so older code
that expects only up to 8 parameter will work with both v6p and Ruamoko
progs without having to check what progs are running. PR_SetupParams is
useful even when Ruamoko progs are expected as it reserves the required
space (respecting alignment) on the stack and returns a pointer to the
top (bottom? confusing) of the stack. PR_PushFrame and PR_PopFrame
need to be used around PR_SetupParams, regardless of using temp strings,
to avoid a stack leak (need to do an audit).
This is part of the work for #26 (Record resource pointer with builtin
function data). Currently, the data pointer gets as far as the
per-instance VM function table (I don't feel like tackling the job of
converting all the builtin functions tonight). All the builtin modules
that register a resources data block pass that block on to
PR_RegisterBuiltins.
The builtin and progs function data is overlaid so the extra data
doesn't cause too much memory to be used (it's actually 8 bytes smaller
now). The plan is to pre-compute the offsets based on the parameter
size and alignment data.
This will make it possible for the engine to set up their parameter
pointers when running Ruamoko progs. At this stage, it doesn't matter
*too* much, except for varargs functions, because no builtin yet takes
anything larger than a float quaternion, but it will be critical when
double or long vec3 and vec4 values are passed.
Just 32-bit rounding to next higher power of two, and base 2 logarithm.
Most importantly, they are suitable for use in initializers as they are
constant in, constant out.
As even the simplest v6p functions that take parameters but have no
local or temporary variables still have locals for the local copy of the
parameters, this is a both a good check for for the Ruamoko ISA as its
functions never have locals (everything's on the progs data stack), and
an optimization for v6p functions that have no params or locals (simple
getters (very rare?), most .ctor, etc).
And fix an incorrect definition for RETURN_QUAT.
Prefixed MAX_STACK_DEPTH and LOCALSTACK_SIZE (and LOCALSTACK_SIZE got an
extra _).
The rest is just edits to documentation comments.
ldconst isn't implemented yet but the plan is to load various constants
(eg, 0, 1, 2, pi, e, ...).
Stack adjust is useful for adding an offset to the stack pointer without
having to worry about finding it (and it checks for alignment).
nop is just that :)
Due to how OP_RETURN works, a destination is required for any function
returning data, but the caller may not have allocated any space for the
value. Thus the VM maintains a buffer into which the data can be put and
ignored. It also makes a good place for return values when the engine
calls Ruamoko code as trusting progs code with return sizes seems like a
recipe for disaster, especially if the return location is on the C
stack.
It turned out that address mode B was redundant as C with 0 offset
(immediate) was the same (except for the underlying C code of course,
but adding st->b is very cheap). This allowed B to be used for
entity.field for all transfer operations. Thus instructions 0-3 are now
free as load E became load B, and other than the specifics of format
codes for statement printing, transfers+lea are unified.
This makes the v6p instruction table consistent with the ruamoko
instruction table, and clears up some of the ugliness with the load,
store, and assign instructions (. .= and = are now spelled out). I think
I'd still prefer an enum code (faster) but at least this is more
readable.
long is ignored for double, and v6p progs are stuck with 32 bits for
longs (don't feel like extending v6p any further), but the basics are
there for Ruamoko.
short is ignored for ints because the minimum size is 32, and signed is
just noise for ints anyway (and no chars, so...).
unsigned, however, is finally implemented properly (or at least seems to
be working correctly: tests pass after getting things compiling again,
and lt.u is used where it should be :)
And provide a table for such for qfcc and the like. With this, using
pr_double_t (for example) in C will cause the double value to always be
8-byte aligned and thus structures shared between gcc and qfcc will be
consistent (with a little fuss to take care of the warts).
And other related fields so integer is now int (and uinteger is uint). I
really don't know why I went with integer in the first place, but this
will make using macros easier for dealing with types.
They are both gone, and pr_pointer_t is now pr_ptr_t (pointer may be a
little clearer than ptr, but ptr is consistent with things like intptr,
and keeps the type name short).
This required delaying the setting of the return pointer by call until
after the current pointer had been saved, and thus passing the desired
pointer into PR_CallFunction (which does have some advantages for C
functions calling progs functions, but some dangers too (should ensure a
128 byte (32 word) buffer when calling untrusted code (which is any,
really)).
This fixes the issue of the data stack not being restored properly
because the returning function needs to return a value from its local
variables (stored on the stack) and accessing stack data below the stack
pointer is a bad idea (sure, no interrupts yet, but who knows...).
Call's operand c is used to specify where the return value of the
function is to be stored. This gets both the correct function being
called, and the value being returned correctly. Test still fails due to
the stack restoration issue.
It currently fails for two reasons:
- call's mode selection is incorrect (never updated from when there was
only the one call instruction and the mode was encoded in operand c)
- return should automatically restore the stack pointer to the value it
had on entry to the function, thus allowing local values stored on
the stack to be safely returned.
I don't know why they were ever signed (oversight at id and just
propagated?). Anyway, this resulted in "unsigned" spreading a bit, but
all to reasonable places.
This has been a long-held wishlist item, really, and I thought I might
as well take the opportunity to add the instructions. The double
versions of STATE require both the nextthink field and time global to be
double (but they're not resolved properly yet: marked with
"FIXME double time" comments).
Also, the frame number for double time state is integer rather than
float.
In the end, I decided any/all/none should be separate from the other
horizontal ops, if I even do them (can be implemented by first
converting to bool, then using the appropriate horizontal operation (& |
etc).
ANY/ALL/NONE have been temporarily removed until I implement the HOPS
(horizontal operations) sub-instructions, which will all both 32-bit and
64-bit operands and several other operations (eg, horizontal add).
All the fancy addressing modes for the conditional branch instructions
have been permanently removed: I decided the gain was too little for the
cost (24 instructions vs 6). JUMP and CALL retain their addressing
modes, though.
Other instructions have been shuffled around a little to fill most of
the holes in the upper block of 256 instructions: just a single small
7-instruction hole.
Rearrangements in the actual engine are mostly just to keep the code
organized. The only real changes were the various IF statements and
dealing with the resulting changes in their addressing.
When creating the tests for lea, I noticed that B was yet another simple
assign, so I decided it was best to drop it and move E into its place
(freeing up another instruction).
Most useful for 64-bit values as only one instruction is needed to move
the data around rather than two, but could be slightly faster for 32-bit
as the addressing is simpler (needs profiling).
The compare/ne operator returns "random" -ve, 0, +ve values (really,
just the numerical difference between the chars of the strings), but all
the rest return -1 for true and 0 for false, as with the rest of the
comparison operators.
Does not include string concatenation because I don't feel like messing
with zone init, but all the other operators are tested (currently
failing due to bool convention)
It calculating only the size of the array (which was often 4 or 8
globals per element) proved to be a pain when I forgot to alter the size
for the new scale tests. Fixing the size calculation even found a bug in
the shiftop tests.
It seems casting from float/double to [unsigned] int/long when the value
doesn't fit is undefined (which would explain the inconsistent results).
Mentioning the possibility seems like a good idea should the results for
such casts change and cause the tests to fail.
Bools turned out to be a problem to due to me wanting any non-zero value
to be treated as true thus had to expand them out as well as the
floating point <-> integral conversions.
They currently fail because for vector values, gcc casts the view, not
the value, so vec4 cast to ivec4 simply views the bits as int rather
than doing the actual conversion.
Rather than specifying that the conversion should be skipped, it now
specifies the mode of the conversions (with 0 being no conversion). This
is in preparation for boolean conversion.
I realized that being able to do bit-wise operations with 64-bit values
(and 256-bit vectors) is far more important than some convenient boolean
logic operators. The logic ops can be handled via the bit-wise ops so
long as the values are all properly boolean, and I plan on adding some
boolean conversion ope, so no real loss.
Both float 2,3,4 vectors and double 2,3,4 vectors (1 would be just a
copy of the mul instructions).
This completes the currently planned instructions. Now for testing.
Not all possibilities are supported because converting between int and
uint, and long and ulong is essentially a no-op. However, thanks to
Deek's suggestion, not only are all reasonable conversions available,
conversions for all widths are available, so vector conversions are
supported.
The code for the conversions is generated.
Thanks to Deek for the suggestion: the mode (ie, src and dst types) are
encoded in st->b. Actual code not written yet, but this frees up 13
instructions: now have 74 available for really interesting stuff :)
The call1-8 instructions have been removed as they are really not needed
(they were put in when I had plans of simple translation of v6p progs to
ruamoko, but they joined the dinosaurs).
The call instruction lost mode A (that is now return) and its mode B is
just the regular function access. The important thing is op_c (with
support for with-bases) specifies the location of the return def.
The return instruction packs both its addressing mode and return value
size into st->c as a 3.5 value: 3 bits for the mode (it supports all
five addressing modes with entity.field being mode 4) and 5 for the
size, limiting return sizes to 32 words, which is enough for one 4x4
double matrix.
This, especially with the following convert patch, frees up a lot of
instructions.
Now they're in a much more consistent arrangement, in particular with
the comparison opcodes if the conditional branch instructions are
considered to be fast comparisons with zero (ifnot -> ifeq, if -> ifne,
etc). Unconditional jump and call fill in the gaps. The goal was to get
them all in an arrangement that would work as a small enum for qfcc: it
can use the enum directly for the ruamoko IS, and a small map array for
v6p (except for call).
Both pr_type_size and pr_type_name. I want to macroize the enum, but
need to sort out the clutter of headers first, just need to decide on
naming. This at least sorts out the missed values for now.
The bug (alignment issues with AVX on windows) seems to have in gcc from
the 4.x days, and is still present in 11.2: it does not ensure stack
parameters that need 32 byte alignment are aligned. Telling gcc to use
the sysv abi (safe on a static function) lets gcc do what it does for
linux (usually pass the parameters in registers, which it seems to have
done).
And partial implementations in qfcc (most places will generate an
internal error (not implemented) or segfault, but some low-hanging fruit
has already been implemented).
As I expect to be tweaking things for a while, it's part of the build
process. This will make it a lot easier to adjust mnemonics and argument
formats (tweaking the old table was a pain when conventions changed).
It's not quite done as it still needs arg widths and types.
While working on the new opcode table, I decided a lot of the names were
not to my liking. Part of the problem was the earlier clash with the
v6p opcode names, but that has been resolved via the v6p tag.
Use the new "1" versions of loadvec3 to get a 1 in w to avoid
divide-by-zero errors, and use the correct type for longs (forgot to
change i to l on the vector types).
It turned out I had no way of using a pointer or field as the value to
load, so all 4 modes are duplicated with loads from where operand b
points, but the loaded value interpreted the same way. Also, fixed an
error in the calculation of op-b offsets.
Statements can be bounds checked in the one place (jump calculation),
but memory accesses cannot as they can be used in lea instructions which
should never cause an exception (unless one of lea's operands is OOB).
* / % %% + -
As a bonus, includes partial tests for a few extra operators. Several
things are broken at this stage, but uncommitted code is already
working.
Float bit-ops as well.
Also, add q*v4 and v4*q instructions. There are currently 48 free
opcodes, and I might remove the scale instructions, but they could be
useful as expanding a single float to a vector would take 3 instructions
(copy to temp, swizzle-expand temp, multiply, vs just scale).
The swizzle instruction is very powerful in that in can do any of the
256 permutations of xyzw, optionally negate any combination of the
resulting components, and zero any combination of the result components
(even all). This means the one instruction can take care of any actual
swizzles, conjugation for complex and quaternion values, zeroing vectors
(not that it's the only way), and probably other weird things.
The python file was used to generate the jump table and actual swizzle
code.
They even found a bug in the addressing mode functions :) (I'd forgotten
that I wanted signed offsets from the pointer and thus forgot to cast
st->b to short in order to get the sign extension)
This allows the VM to select the right execution loop and qfcc currently
still produces only the old IS (it doesn't know how to deal with the new
IS yet)
When it's finalized (most of the conversion operations will go, probably
the float bit ops, maybe (very undecided) the 3-component vector ops,
and likely the CALLN ops), this will be the actual instruction set for
Ruamoko.
Main features:
- Significant reduction in redundant instructions: no more multiple
opcodes to move the one operand size.
- load, store, push, and pop share unified addressing mode encoding
(with the exception of mode 0 for load as that is redundant with mode
0 for store, thus load mode 0 gives quick access to entity.field).
- Full support for both 32 and 64 bit signed integer, unsigned integer,
and floating point values.
- SIMD for 1, 2, (currently) 3, and 4 components. Transfers support up
to 128-bit wide operations (need two operations to transfer a full
4-component double/long vector), but all math operations support both
128-bit (32-bit components) and 256-bit (64-bit components) vectors.
- "Interpreted" operations for the various vector sizes: complex dot
and multiplication, 3d vector dot and cross product, quaternion dot
and multiplication, along with qv and vq shortcuts.
- 4-component swizzles for both sizes (not yet implemented, but the
instructions are allocated), with the option to zero or negate (thus
conjugates for complex and quaternion values) individual components.
- "Based offsets": all relevant instructions include base register
indices for all three operands allowing for direct access to any of
four areas (eg, current entity, current stack frame, Objective-QC
self, ...) instructions to set a register and push/pop the four
registers to/from the stack.
Remaining work:
- Implement swizzle operations and a few other stragglers.
= Make a decision about conversion operations (if any instructions
remain, they'll be just single-component (at 14 meaningful pairs,
that's a lot of instructions to waste on SIMD versions).
- Decide whether to keep CALL1-CALL8: probably little point in
supporting two different calling conventions, and it would free up
another eight instructions.
- Unit tests for the instructions.
- Teach qfcc to generate code for the new instruction set (hah, biggest
job, I'm sure, though hopefully not as crazy as the rewrite eleven
years ago).
I wish I'd done it this way years ago (but maybe gcc 2.95 couldn't hack
the casts, I do know there were aliasing problems in the past). Anyway,
this makes operand access much more consistent for variable sized
operands (eg float vs double vs vec4), and is a big part of the new
instruction set implementation.
There is no reasonable way (due to hardware-enforced alignment issues)
to simply convert old bytecode to new (probably best done with an
off-line tool, preferably just recompiling when I get qfcc up to the
job), so both loops will need to be present. This just moves the
original loop into its own function in order to make it easy to bring in
the new (and iron out integration issues).
And add a unary op macro. Having VectorCompOp makes it easy to write
macros that work for multiple data widths, which is why it and its users
now use (dst, ...) instead of (..., dst) as in the past. I'll sort out
the other macros later now that I know the compiler handily gives
messages about the switched order (uninitialized vars etc).
For int, long, float and double. I've been meaning to add them for a
while, and they're part of the new Ruamoko instructions set (which is
progressing nicely).
The opcode table is a nightmare to maintain, but this does clean it up
and speed up opcode lookups since they can now be indexed. Of course, it
turns out I had missed adding several instructions, so had to fix that,
and qfcc needed a bit of a re-jigger to get the opcode out of the table.
The list of all allocated dispatch tables is used to free all the tables
when the progs are reloaded. Not clearing the list meant that the next
instance (second map change) corrupted the list.
Forgetting to unhook the functions (Sys_Printf and the client console's
input event handler) was not a problem for static builds because the
functions were always present, but in builds with dynamic plugins, the
client console's code got ripped away and thus Sys_Printf and the event
hander were being sent into invalid memory. Too much work, not enough
play (with a fully installed client).
The switch from using pr_functions (dfunction_t) to function_table
(bfunction_t) for keeping track of the current function (and thus
profiling data) broke PR_Profile as it never saw anything but 0.
Even NUM_FOR_BAD_EDICT will have a bad day if the edict pointer is
invalid, so make sure that the entity pointer is valid (within the edict
area AND a multiple of edict size).
PR_LoadDebug now does only the initial version and crc checks, and the
byte-swapping of the loaded symbols file. PR_DebugSetSym sets up all the
pointers.
The homogeneous coord was not being initialized and thus was picking up
rubbish from the stack. This is why the test would succeed in some
circumstances but fail in others.
Forgetting to invoke [super dealloc] in a derived class's -dealloc
method has caused me to waste far too much time chasing down the
resulting memory leaks and crashes. This is actually the main focus of
issue #24, but I want to take care of multiple paths before I consider
the issue to be done.
However, as a bonus, four cases were found :)
Fixes axis inputs being half what they should be. Can't quite get +1,
though (need to figure something out for the positive axis range being
slightly smaller than the negative range).
With some hacks that are not included (plan on handling events and
contexts properly), button inputs, including using listeners, are
working nicely: my little game is working again. While the trampoline
code was a bit repetitive (and I do want to clean that up), connecting
button listeners directly to Ruamoko instance methods proved to be quite
nice.
mtwist_rand_0_1 produces numbers in the range [0, 1) and
mtwist_rand_m1_1 produces numbers in the range (-1, 1). The numbers will
not be denormal, so the distribution should be fairly uniform (as much
as Mersenne Twister itself is), but this needs proper testing.
0 is included for the mtwist_rand_0_1 as it seems useful, but -1 is not
included in mtwist_rand_m1_1 in order to keep the extremes of the
distribution balanced around 0.
And create rua_game to coordinate other game builtins.
Menus are broken for key handling, but have been since the input rewrite
anyway. rua_input adds the ability to create buttons and axes (but not
destroy them). More work needs to be done to flesh things out, though.
This takes care of the global variables to a point (there is still the
global struct shared between the non-vulkan renderers), but it also
takes care of glsl's points-only rendering.