Most of the set ops were always endian-agnostic since they were simply
operating on multiple bits in parallel, but individual element
add/remove/test was very endian-dependent. For the most part, this
didn't matter, but it does matter very much when loading external data
into a set or writing the data out (eg, for PVS).
Now that only 3852 clusters need to be checked for each cluster, fat-pvs
construction for ad_tears completes in about 0.7s, most of which seems
to be loading, conversion, compression and writing. O(N^3) cuts both
ways (hurts like crazy when N increases, does wonders when N decreases,
especially by a factor of 25). And then throw in improved cache
performance...
I suspect having an off-line compiler is still useful, but even if
qfvis's implementation never actually gets used, if cluster
reconstruction is put in the engine, large maps will be feasible even
for quakeworld. Just the reduced memory requirements alone will be a
huge benefit (~3GB down to 1.8MB).
This is only the first half (vertical) in that the vis bits are still
for the leafs rather than the clusters, but ad_tears goes from 500s to
7s for calculating the fat pvs (3852 clusters).
While this doesn't give as much of a boost as does basic sphere culling
(since it's just culling sphere tests), it took ad_tears' base vis from
1000s to 720s on my machine.
Attempting to vis ad_tears drags a few lurking bugs out of
SmallestEnclosingBall_vf: poor calculation of 2-point affine space, poor
handling of duplicate points and dropped support points, poor
calculation of the new center (related to duplicate points), and
insufficient iterations for large point sets. qfvis (modified for
cluster spheres) now loads ad_tears.
This removes the last of the arbitrary limits from qfvis. The goal is
not so much supporting crazy maps, but more about better data usage
(cluster_t is now 24 (or 16) bytes instead of 1048 (or 528). And
passages isn't used (yet?)...
As per usual, fp math finds a way to confound any epsilon test. So
rather than relying entirely on test_support_points, check the distance
from the sphere center to the affine point and break out of the loop if
the distance is small enough (< 1% of the current radius). This allows
qfvis to load ad_tears without hacks.
Scaling the checks by 1e-6 was a little too tight for very small
triangles, but 1e-5 seems to work well. This fixes SEB getting stuck for
a ridiculously small (for quake) triangle in ad_tears (probably resulted
from some bad math in qfbsp when generating the portal file from the
bsp).
It turns out cmem is not so good for many large allocations (probably a
bug in handling the blocks), but was really meant for lots of little
churning allocations anyway. After an analysis of winding lifetimes, it
became clear that the hunk allocator would work very well. The base
windings are allocated from a global hunk (currently 1GB, plenty for
even ad_tears), and ephemeral windings are allocated from a per-thread
hunk of 1MB (seems to be way more than enough: gmsp3v2 uses a maximum of
only 56064 bytes, and ad_tears got through 30% before I gave up on it).
Any speed difference (for gmsp3v2) seems to be lost in the noise: still
completing in 38.4s on my machine.
For now, the functions check for a null hunk pointer and use the global
hunk (initialized via Memory_Init) if necessary. However, Hunk_Init is
available (and used by Memory_Init) to create a hunk from any arbitrary
memory block. So long as that block is 64-byte aligned, allocations
within the hunk will remain 64-byte aligned.
I need to write some automated tests for this, and reading of course,
but 1 and two byte outputs look correct. Kind of sad it took sixteen
years to get around to attempting to use the code :(
The output fat-pvs data is the *difference* between the base pvs and fat
pvs. This currently makes for about 64kB savings for marcher.bsp, and
about 233MB savings for ad_tears.bsp (or about 50% (470.7MB->237.1MB)).
I expect using utf-8 encoding for the run lengths to make for even
bigger savings (the second output fat-pvs leaf of marcher.bsp is all 0s,
or 6 bytes in the file, which would reduce to 3 bytes using utf-8).
Mod_DecompressVis_set (via Mod_LeafPVS_set) can be used to recycle pvs
sets, but the set may have been set to everything at some stage, which
is implemented by inverting the set (making the set infinite) and having
1-bits remove elements from the set. This is most definitely not wanted
for pvs :)
Currently undecided what to do about Mod_DecompressVis_mix, thus the
fixme.
Fixes the flickering lights in any map where the camera is out of the
map for a single frame (eg, start.bsp, The Catacombs (hipnotic, hip2m3)).
I knew counting bits individually was slow, but it never really mattered
until now. However, I didn't expect such a dramatic boost just by going
to mapping bytes to bit counts. 16-bit words would be faster still, but
the 64kB lookup table would probably start hurting cache performance,
and 32-bit words (4GB table) definitely would ruin the cache. The
universe isn't big enough for 64-bits :)
The fact that numleafs did not include leaf 0 actually caused in many
places due to never being sure whether to add 1. Hopefully this fixes
some of the confusion. (and that comment in sv_init didn't last long :P)
After seeing set_size and thinking it redundant (thought it returned the
capacity of the set until I checked), I realized set_count would be a
much better name (set_count (node->successors) in qfcc does make much
more sense).
This fixes some out-by-one bugs caused by there being an obo bug in the
original SV_CalcPHS: space was being allocated for numleafs leafs, but
numleafs does not count the world-surrounding solid leaf 0, but
SV_CalcPHS generates pvs and phs data (just the "everything" set) for
leaf 0, which is later used for broadcasts and the like.
Extremely large maps take a very long time to process their PVS sets for
PHS or shadows, so having an off-line compiler seems like a good idea.
The data isn't written out yet, and the fat pvs code may not be optimal
for cache access, but it gets through ad_tears in about 500s (12
threads, compared to 2100s single-threaded in the qw server).
Hunk_Alloc has a rather large per-block overhead. Also, small pvs sets
(64 leafs) will fit into just the basic set structure so no need to
allocate data for the maps.
Modern maps can have many more leafs (eg, ad_tears has 98983 leafs).
Using set_t makes dynamic leaf counts easy to support and the code much
easier to read (though set_is_member and the iterators are a little
slower). The main thing to watch out for is the novis set and the set
returned by Mod_LeafPVS never shrink, and may have excess elements (ie,
indicate that nonexistent leafs are visible).
Having set_expand exposed is useful for loading data into a set.
However, it turns out there was a bug in its size calculation in that
when the requested set size was a multiple of SET_BITS (and greater than
the current set size), the new set size one be SET_BITS larger than
requested. There's now some tests for this :)