Because of the way the plane normal is used (front/on/back checks, and
midpoint calculation), other than possible precision, there is no need
to normalize the normal. Removing the square root and division resulted
in a huge boost: from 34s to 14 seconds. The average clusters visible
hasn't change much, and a quick check in-game didn't show any issues.
At least modern gcc produces nice code for ?: (cmov), and a SIMD
cross-product uses several fewer instructions. The cross-product shaved
off 0.5-1s, but the modulo -> ?: shaved off about 3-4s, for a total of
about 10% speedup (1.09 insn/cyc vs 1.01 insn/cyc, so even perf agrees).
For int, long, float and double. I've been meaning to add them for a
while, and they're part of the new Ruamoko instructions set (which is
progressing nicely).
The portal flow stack nodes contain a simd vector, which requires
16-byte alignment. However, on 32-bit Windows, malloc returns 8-byte
aligned memory, leading to eventual segfaults. Since pstack_t is 48
bytes on 32-bit systems, it fits nicely into a 64-byte aligned cache
line (or two on 64-bit systems due to being 80 bytes).
This removes the last of the arbitrary limits from qfvis. The goal is
not so much supporting crazy maps, but more about better data usage
(cluster_t is now 24 (or 16) bytes instead of 1048 (or 528). And
passages isn't used (yet?)...
It turns out cmem is not so good for many large allocations (probably a
bug in handling the blocks), but was really meant for lots of little
churning allocations anyway. After an analysis of winding lifetimes, it
became clear that the hunk allocator would work very well. The base
windings are allocated from a global hunk (currently 1GB, plenty for
even ad_tears), and ephemeral windings are allocated from a per-thread
hunk of 1MB (seems to be way more than enough: gmsp3v2 uses a maximum of
only 56064 bytes, and ad_tears got through 30% before I gave up on it).
Any speed difference (for gmsp3v2) seems to be lost in the noise: still
completing in 38.4s on my machine.
While whether it's any faster is debatable (it's slightly slower, but
many more portals are being tested due to different rounding in the base
vis stage), it's certainly easier to read.
While the main bulk of the improvement (36s down from 42s for
gmsp3v2.bsp on my i7-6850K) comes from using a high-tide allocator for
the windings (which necessitated using a fixed size), it is ever so
slightly faster than using malloc as the back-end.
There's still some cleanup to do, but everything seems to be working
nicely: `make -j` works, `make distcheck` passes. There is probably
plenty of bitrot in the package directories (RPM, debian), though.
The vc project files have been removed since those versions are way out
of date and quakeforge is pretty much dependent on gcc now anyway.
Most of the old Makefile.am files are now Makemodule.am. This should
allow for new Makefile.am files that allow local building (to be added
on an as-needed bases). The current remaining Makefile.am files are for
standalone sub-projects.a
The installable bins are currently built in the top-level build
directory. This may change if the clutter gets to be too much.
While this does make a noticeable difference in build times, the main
reason for the switch was to take care of the growing dependency issues:
now it's possible to build tools for code generation (eg, using qfcc and
ruamoko programs for code-gen).
For the most part, it's just refactoring the code so the plane creation and
testing are in separate functions, but there is one important difference:
the plane test now checks only the two points on either side of the point
used to create the plane.
Because the portal winding is guaranteed to be convex and planar, if both
points are on the plane, all points are, and if neither point is behind the
plane, no points are.a
This shaved about 5 seconds off the level 4 run using 4 threads (~198s to
~193s) and about 12s from the single threaded run (~682s to ~670s (hmm,
gained some time in recent changes)).
This reverts commit 1ea79e8626.
Conflicts:
tools/qfvis/include/vis.h
tools/qfvis/source/flow.c
I've decided to do reentrant versions of the set allocators and I didn't
particularly like the invasiveness of allocating sets this way.
The old variable names were confusing ("target" winding comes from
"portal"?), and the comments were from when I really didn't understand
concepts like separating planes. While they weren't wrong, they were quite
inadequate and I want to write new ones.
This bypasses set_new, but completely removes the use of the global lock
from within RecursiveClusterFlow. This seems to give a small speedup: 203
seconds threaded.
This was testing an idea I had to remove the plane flips. It seems to have
been good for the initial plane orientation, but was a slight slowdown for
the pass-portal test. However, this makes the code a little easier to work
with for my idea on improving the algorithm itself.
Since the stack structure in the thread data is a linked list, move the
stack blocks off the program stack and into malloced memory. More
importantly, when the stack block is allocated, the mightsee working set is
allocated too, and as neither are freed, this greatly reduces contention
for the lock. Also, because the memory is kept, single threaded time for
gmsp3v2 dropped from 695s to 670s. Threaded is now about 207s (down from
350).
While using set operators was clearer, it was rather expensive (about 25s
for gmsp3v2). qfvis now completes the map in about 695s (single threaded).
About 15s faster than tyr for the same conditions (1 thread, level 4).
This is the second part of the separator search optimization from tyrutils
vis. With this, qfvis is getting close to tyrutils vis when
running single threaded (qfvis is suffering some nasty thread contention
and thus can't get below about 350 seconds with 4 threads). 808s vs 707s.
Interesting, it makes very little (maybe faster) difference to find all the
separators for levels 3 and 4. This might be due to the higher levels using
most of the planes to fully clip source away. Anyway, it makes the code a
little clearer (one function, one task).
I think the reason I didn't think of that when I tried to improve qfvis's
performance many years ago is I just simply did not understand
ClipToSeparators. However, the difference caching the separators makes is
phenomenal. Before the change, single threaded qfvis would get stuck on one
particular portal for at least a day (I gave up waiting), but now even a
debug build will complete gmsp3v2.bsp in less than 12 minutes (4 threads on
my quad-core). And that's at level 2! Getting stuck for a day was at level
0.
While noticeably slower than the previous expanded set manipulation code,
this is much easier to read. I can worry about optimizing the set code when
I get qfvis behaving better.