Just 32-bit rounding to next higher power of two, and base 2 logarithm.
Most importantly, they are suitable for use in initializers as they are
constant in, constant out.
And add a unary op macro. Having VectorCompOp makes it easy to write
macros that work for multiple data widths, which is why it and its users
now use (dst, ...) instead of (..., dst) as in the past. I'll sort out
the other macros later now that I know the compiler handily gives
messages about the switched order (uninitialized vars etc).
For int, long, float and double. I've been meaning to add them for a
while, and they're part of the new Ruamoko instructions set (which is
progressing nicely).
The homogeneous coord was not being initialized and thus was picking up
rubbish from the stack. This is why the test would succeed in some
circumstances but fail in others.
For now, just dot product, trig, and min/max/bound, but it works well as
a proof of concept. The main goal was actually min. Only the list of
symbols is provided, it is the user's responsibility to set up the
symbol table and context.
cexpr's symbol tables currently aren't readily extended, and dynamic
scoping is usually a good thing anyway. The chain of contexts is walked
when a symbol is not found in the current context's symtab, but minor
efforts are made to avoid checking the same symtab twice (usually cased
by cloning a context but not updating the symtab).
I decided cvars and input buttons/axes need listeners so any changes to
them can be propagated. This will make using cvars in bindings feasible
and I have an idea for automatic imt switching that would benefit from
listeners attached to buttons and cvars.
At the low level, only unions can cause a set to grow. Of course, things
get interesting at the higher level when infinite (inverted) sets are
mixed in.
Instead of printing every representable member of an infinite set (ie,
up to element 63 in a set that can hold 64 elements), only those
elements up to one after the last non-member are listed. For example,
{...} - {2 3} -> {0 1 4 ...}
This makes reading (and testing!) infinite sets much easier.
Attempting to vis ad_tears drags a few lurking bugs out of
SmallestEnclosingBall_vf: poor calculation of 2-point affine space, poor
handling of duplicate points and dropped support points, poor
calculation of the new center (related to duplicate points), and
insufficient iterations for large point sets. qfvis (modified for
cluster spheres) now loads ad_tears.
Scaling the checks by 1e-6 was a little too tight for very small
triangles, but 1e-5 seems to work well. This fixes SEB getting stuck for
a ridiculously small (for quake) triangle in ad_tears (probably resulted
from some bad math in qfbsp when generating the portal file from the
bsp).
I knew counting bits individually was slow, but it never really mattered
until now. However, I didn't expect such a dramatic boost just by going
to mapping bytes to bit counts. 16-bit words would be faster still, but
the 64kB lookup table would probably start hurting cache performance,
and 32-bit words (4GB table) definitely would ruin the cache. The
universe isn't big enough for 64-bits :)
Having set_expand exposed is useful for loading data into a set.
However, it turns out there was a bug in its size calculation in that
when the requested set size was a multiple of SET_BITS (and greater than
the current set size), the new set size one be SET_BITS larger than
requested. There's now some tests for this :)
This reduces the overhead needed to manage the memory blocks as the
blocks are guaranteed to be page-aligned. Also, the superblock is now
alllocated from within one of the memory blocks it manages. While this
does slightly reduce the available cachelines within the first block (by
one or two depending on 32 vs 64 bit pointers), it removes the need for
an extra memory allocation (probably via malloc) for the superblock.
This failed with errors such as:
from ./include/QF/simd/vec4d.h:32,
from libs/util/simd.c:37:
./include/QF/simd/vec4d.h: In function ‘qmuld’:
/usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0/include/avx2intrin.h:1049:1: error: inlining failed in call to ‘always_inline’ ‘_mm256_permute4x64_pd’: target specific option mismatch
1049 | _mm256_permute4x64_pd (__m256d __X, const int __M)
Fuzzy bsearch is useful for finding an entry in a prefix sum array
(value is >= ele[0], < ele[1]), and the reentrant version is good when
data needs to be passed to the compare function. Adapted from the code
used in pr_resolve.
A bit of a mess for optimized vs unoptimized, but the tests acknowledge
the differences in precision while checking that the code produces the
right results allowing for that precision.
It seems that i686 code generation is all over the place reguarding sse2
vs fp, with the resulting differences in carried precision. I'm not sure
I'm happy with the situation, but at least it's being tested to a
certain extent. Not sure if this broke basic (no sse) i686 tests.
GCC does a fairly nice job of producing code for vector types when the
hardware doesn't support SIMD, but it seems to break certain math
optimization rules due to excess precision (?). Still, it works well
enough for the core engine, but may not be well suited to the tools.
However, so far, only qfvis uses vector types (and it's not tested yet),
and tools should probably be used on suitable machines anyway (not
forces, of course).
I don't know that the cache line size is 64 bytes on 32 bit systems, but
it should be ok to assume that 64-byte alignment behaves well on systems
with smaller cache lines so long as they are powers of two. This does
mean there is some waste on 32-bit systems, but it should be fairly
minimal (32 bytes per memblock, which manages page sized regions).
This seems to be pretty close to as fast as it gets (might be able to do
better with some shuffles of the negation constants instead of loading
separate constants).
Care needs to be taken to ensure the right function is used with the
right arguments, but with these, the need to use qconj(d|f) for a
one-off inverse rotation is removed.
I think the sub-line allocator falling over is the final source of
qfvis's leaks. It certainly causes a mess of the sub-lines. But having
some tests to get working sure beats scratching my head over qfvis :)
The idea is to not search through blocks for an available allocation.
While the goal was to speed up allocation of cache lines of varying
cluster sizes, it's not enough due to fragmentation.