Skip to content

Alt build#92

Merged
jermp merged 33 commits intomasterfrom
alt-build
May 6, 2026
Merged

Alt build#92
jermp merged 33 commits intomasterfrom
alt-build

Conversation

@jermp
Copy link
Copy Markdown
Owner

@jermp jermp commented May 6, 2026

No description provided.

jermp and others added 30 commits April 10, 2026 19:04
The skew-index build used to do a random read into the strings
bitvector for every super-kmer in every heavy-load bucket
(buckets are visited in size-sorted order, so the resulting
positions are essentially random across all of strings). That
forces strings to be RAM-resident throughout step 7.2.

This commit restructures step 7.2 into three sub-steps:

  (A) walk the heavy-load buckets and emit one
      kmer_extraction_request per super-kmer, externally
      sort+flushed by starting_pos (parallel_sort within a
      bounded RAM buffer, ~1/4 of --ram-limit).

  (B) merge the sorted runs and walk strings in a single
      forward pass; for each request extract the requested
      kmers and append (kmer.bits, pos_in_bucket) to a
      per-partition tmp file. Only kmer.bits is serialized to
      avoid persisting the vptr that uint_kmer_t carries via
      its virtual destructor.

  (C) for each partition, read its tmp file, build the MPHF
      and the positions compact vector. The skew index is
      assembled partition by partition.

The access pattern over strings is now monotonically
non-decreasing, which is the precondition for moving strings
itself out of RAM in a follow-up change. Correctness verified
via `sshash build --check` on salmonella_enterica (m=7),
canonical mode, and salmonella_100 with 4 threads.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The strings bit-vector is the largest in-RAM structure during
construction. This commit moves it onto disk for the build
phase, replacing bits::bit_vector::builder with a new
disk_backed_strings storage:

  - append_bits() during step 1 (encode_strings) writes
    completed words to a tmp file, keeping only a small write
    window in RAM (~512 KiB by default).

  - After freeze(), make_reader() returns a forward-monotonic
    reader over the file with a small read window (~512 KiB).
    Each reader owns its own ifstream so multiple threads can
    read concurrently without contention on the writer.

  - The reader exposes get_word64(pos) const, matching the
    interface that kmer_iterator<Kmer, BitVector> expects.

Wiring:
  - compute_minimizer_tuples (step 2): each thread instantiates
    its own reader; per-thread RSS contribution is bounded by
    its window size, not by the strings size.
  - build_sparse_and_skew_index (step 7.1): the in-RAM
    d.m_spss.strings is no longer populated here; step 7.2
    phase (B) reads via a reader instead.
  - A new step 8 materializes d.m_spss.strings from the on-disk
    file, immediately before the standard essentials::save path.
    This brings strings briefly back into RAM at the very end of
    the build; eliminating that final peak requires a streaming
    save path, which is a separate change.

Cleanup: the dictionary_builder destructor removes the strings
tmp file if it still exists (covers exception paths). The hash
in step 5's RAM_taken_in_bytes calculation no longer counts
strings (its in-RAM footprint is now just the writer window).

Verified via `sshash build --check` on:
  - salmonella_enterica m=7 (heavy skew index, regular and
    canonical),
  - salmonella_100 m=11 -t 4 (multi-thread, 13M kmers).
Save -> load -> query roundtrip on the same dataset matches
100% on the canonical query set.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Introduces a streaming-save build path that writes the entire
dictionary to disk without ever materializing the strings
bit-vector in RAM. Combined with the disk-backed strings
storage from the previous commit, this means the strings live
exclusively on disk for the lifetime of the build, eliminating
the final RAM peak that step 8 previously reintroduced just
before essentials::save.

Mechanism:
  - disk_backed_strings::save_to(ostream&) emits the same byte
    layout as bits::bit_vector's serialization (uint64_t
    m_num_bits, size_t n, then n*8 bytes), reading the words
    from the tmp file in 64 KiB chunks.

  - streaming_strings_saver wraps essentials::generic_saver and
    overrides visit() for bits::bit_vector: when the visited
    instance matches a known address (d.m_spss.strings), the
    streaming serializer is invoked; everything else goes
    through the normal essentials path. Address matching avoids
    introducing a marker type into bits::bit_vector.

API:
  - dictionary_builder::build() (existing) materializes strings
    in RAM at the end so the dictionary is query-ready
    (--check, etc.).
  - dictionary_builder::build_streaming_save() runs steps 1-7
    and stream-saves directly to the output file, leaving
    d.m_spss.strings empty. The dictionary is *not*
    query-ready after this; reload from disk to query.
  - dictionary::build_streaming_save() exposes the new flow.

tools/build.cpp uses the streaming-save path automatically when
-o is given without --check; otherwise it falls back to the
materializing path so --check can run queries against d.

Verified by building the same input via both paths and
diff'ing the output file: byte-identical for regular,
--canonical, --weighted, and 4-thread builds. Loading the
streaming-saved file via essentials::load and running the
standalone `sshash check` and `sshash query` tools
returns "EVERYTHING OK!" and 100% positive matches.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Previously step 7.1 copied every minimizer tuple of every
non-singleton bucket into an in-RAM vector `tuples`, just to
give bucket_type a contiguous backing store. The mmap'd `input`
already provides exactly that, so the copy was pure overhead
(~18 B per super-kmer in non-singleton buckets, scaling with
the input).

Now bucket_type stores raw minimizer_tuple pointers into
input.data() directly, and `input` is kept open through step
7.2 phase (A) (which is the last consumer). After phase (A)
both `buckets` and `input` are released.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted builds.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
…ry MPHF

Phase C used to materialize the full per-skew-partition working
set in RAM:
  - kmers vector (~16 B/kmer, default uint64_t-backed kmer)
  - positions_in_bucket vector (~4 B/kmer)
  - cvb_positions builder (~num_bits_per_pos / 8 B/kmer, the
    actual stored output)

For a partition with N kmers this peaked at ~21 N bytes (e.g.
20 GB for a 1 B-kmer partition). The kmers and
positions_in_bucket vectors were redundant in-RAM copies of
data already on disk in the per-partition tmp file written by
phase (B).

This commit replaces them with two streaming passes over the
tmp file:

  (1) MPHF build via pthash's `build_in_external_memory` driven
      by a small forward iterator (`skew_partition_kmer_iterator`)
      that reads `(kmer.bits, pos_in_bucket)` records via a
      shared_ptr<ifstream>. pthash spills hashes to
      `tmp_dirname` under a `--ram-limit / 2` RAM budget rather
      than holding all keys + hashes simultaneously.

  (2) A second sequential pass over the same tmp file fills
      cvb_positions: for each `(kmer, pib)` record it sets
      cvb_positions[F(kmer)] = pib.

Only cvb_positions itself stays resident through both passes,
and it's the actual stored output (not a transient).

The iterator must be copyable because pthash's
`build_in_external_memory` takes the iterator by value; the
shared_ptr<ifstream> means copies share the underlying stream
state. After the build call returns the original at the call
site is unused, so the shared advancement is harmless.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted builds.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Removes the last mmap from the build hot path. The merged
minimizers file used to be opened via mm::file_source and
walked twice — once for stats and once to populate a
size-sorted in-RAM `buckets` vector — and then walked again
in step 7.2 phase (A) via pointers into the mmap. The recent
"point bucket_type into the mmap" change pushed even more
work onto the page cache, which violates the user's hard RAM
budget (mmap pages can be resident up to file size when the
machine has enough memory).

This commit replaces all of that with two sequential
std::ifstream passes (no mmap) over the merged minimizers
file:

  Pass 1 (stats): unchanged in spirit. Uses a new
    `streaming_minimizer_bucket_reader` that buffers one
    bucket at a time (peak ~ max_bucket_size * 18 B). Feeds
    `buckets_statistics` exactly as before.

  Pass 2 (combined sparse + heavy + emit kmer requests):
    folds the former step 7.1 main pass and step 7.2 phase
    (A) into a single bucket-by-bucket loop. The size-sorted
    iteration over `buckets` is gone; instead:

      - `begin_buckets_of_size[s]` is precomputed from the
        bucket-size histogram (new accessor
        `buckets_statistics::num_buckets_of_size`),
      - mid_load positions are written via per-size cursors
        using `compact_vector::builder::set` instead of
        `push_back`,
      - heavy_load positions are appended in file order via a
        single monotone cursor,
      - heavy buckets emit kmer-extraction requests in-line.

The `buckets` vector and the entire `tuples` array are gone.
No memory mapping anywhere in step 7. RAM footprint of step
7.1 is now bounded by:
  - max_bucket_size * 18 B (one bucket at a time),
  - the sparse-index builders being assembled (proportional
    to non-singleton positions, bits-packed),
  - the kmer-extraction request buffer (~ ram_limit / 4).

Output bytes differ from the previous commit because the
ordering of positions inside mid_load_buckets and
heavy_load_buckets is now file order instead of size-sorted
order; the codewords are updated to match. The index is
self-consistent: full --check passes on regular, --canonical,
multi-thread (-t 4), and --weighted builds, and a streaming-
save round-trip via `sshash check` and `sshash query`
returns "EVERYTHING OK!" and 100% positive matches.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
This commit eliminates the last three mm::file_source usages
in the build, so the only mmap left in step 7 is inside
pthash's external-memory PHF builder (which manages its own
RAM via config.ram).

(1) file_merging_iterator: each input run is now read with a
    bounded buffered std::ifstream (default 4096 records per
    stream) instead of mm::file_source. The winner-tree merge
    logic is unchanged; comparisons just use the in-RAM
    buffer's current value rather than a pointer into mmap'd
    memory. RSS for the merge is now bounded by
    `num_runs * buffer_records * sizeof(T)` regardless of run
    sizes.

(2) minimizers_tuples::merge: the post-rename single-file
    count path used to mmap the merged file via
    mm::file_source and walk it with minimizers_tuples_iterator.
    It now uses streaming_minimizer_bucket_reader (hoisted
    from build_sparse_and_skew_index.cpp into util.hpp) for a
    pure ifstream pass.

(3) dictionary_builder::build_mphf: replaces mm::file_source
    + minimizers_tuples_iterator with a new
    streaming_minimizers_iterator that yields each distinct
    minimizer once via std::ifstream. The iterator is
    copyable (shared_ptr<ifstream>), as required by pthash's
    by-value `build_in_external_memory` signature.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted builds, plus
a streaming-save round-trip (sshash check + sshash query)
returning "EVERYTHING OK!" on all five suites and 100%
positive matches.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The strings_offsets_builder used to be an in-RAM
std::vector<uint64_t> of size num_strings + 1, finalized into
d.m_spss.strings_offsets at step 7.1. For inputs with many
strings this is non-trivial RSS (8 B per string), held
through steps 1-7.

This commit replaces it with a disk-backed
disk_backed_offsets_builder<Offsets>:

  - push_back() during step 1 spills to a tmp file under a
    small in-RAM write buffer (~32 KiB),
  - compute_minimizer_tuples (step 2) opens one
    `reader` per thread, positioned at the thread's
    index_begin, and walks forward sequentially via a
    bounded read buffer (~32 KiB / thread). The per-thread
    `[i]`, `[i+1]` access pattern is replaced with a single
    rolling `prev_offset = next()` cursor.
  - encode(offset, begin, string_id) is now a pure const
    method on the disk-backed builder (depends only on
    m_nb and m_size), preserving multi-threaded safety.
  - build(target) at step 7.1 streams the file's contents
    via a copyable forward iterator into target.m_seq's
    encode/build, so neither side materializes the offsets
    in RAM. The on-disk file is removed after build.

To keep `target.m_seq` accessible from the external builder
without exposing it broadly, `offsets<Seq>` befriends
`disk_backed_offsets_builder<...>` via a templated friend
declaration; the concrete decoded_offsets / encoded_offsets
types inherit that friendship.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, and a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The four largest in-RAM "stored output" structures during the
build are the sparse-index compact_vectors (control_codewords,
mid_load_buckets, heavy_load_buckets) and the per-skew-partition
cvb_positions. For huge inputs these can each be many GB. This
commit makes the build never materialize them in RAM:

  - new streaming_compact_vector_writer: writes the same byte
    layout as bits::compact_vector::visit_impl (size, width,
    mask, owning_span<u64>) directly to a file via a 2-word
    rolling window, accepting set(index, value) in monotonic
    index order. Matches the +1 padding word that the in-RAM
    builder allocates so on-disk bytes are identical.

  - control_codewords: indices are mphf(minimizer), and the
    merged minimizers file is sorted by mphf hash, so writes
    during the combined pass are strictly monotonic. Streamed
    directly via the writer; no external sort.

  - heavy_load_buckets: single monotone cursor, also streamed
    directly via the writer.

  - mid_load_buckets: per-size cursors interleave across the
    pass, so each size class's positions are written to its
    own raw-uint64 tmp file (monotonic within size). After the
    combined pass, the per-size files are streamed in size
    order through a streaming_compact_vector_writer to assemble
    the final mid_load_buckets file.

  - cvb_positions per skew partition: writes are random by
    F(kmer). After the partition's MPHF is built, a second
    pass over the partition's kmer file emits
    (F(kmer), pos_in_bucket) tuples through the existing
    parallel_sort + flush + file_merging_iterator external-sort
    machinery; the sorted stream feeds a per-partition
    streaming_compact_vector_writer.

The streaming saver is extended to take a substitution map
keyed by `bits::compact_vector const*`. dictionary_builder
populates the dictionary's compact_vector slots with empty
placeholders, takes their addresses, and registers each
spilled tmp file. The save pass copies bytes from each tmp
file at the matching visit slot.

For the materializing `build()` flow (used by --check), a new
`materialize_spilled_into(d)` step re-loads each spilled tmp
file back into the dictionary's in-RAM compact_vectors via
essentials::loader, so queries work afterward. This brings
the RAM peak back briefly at the very end (acceptable since
--check inherently needs the full index in RAM).

`build_streaming_save()` never materializes; the spilled tmp
files are concatenated into the output by the saver and then
removed.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, plus a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.
No tmp file leaks.

Remaining proportional-to-input items in RAM during build:
the codewords MPHF (step 4) and the per-skew-partition MPHFs
(step 7.2 phase C). pthash returns these as in-memory
structs; spilling them would require pthash changes or an
intermediate save/load step. Everything else is now bounded
by the explicit --ram-limit.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The accumulating in-RAM MPHFs were the dominant resident
memory at the end of phase C. For the HPRC k=63 m=31 canonical
benchmark (5.9 B kmers, ~5.4 B in skew at ~3 bits/key) they
sum to ~2 GB held simultaneously through save.

This commit spills them to disk and concatenates at save:

  - Step 5 (hash_minimizers): the codewords MPHF is no longer
    needed after the minimizer values are remapped. Save it to
    a tmp file via essentials::save and default-assign the
    in-RAM struct to free it.

  - Step 7.2 phase C: after each partition's cvb_positions has
    been written to disk, save that partition's MPHF to a tmp
    file and default-assign it. Subsequent partitions never
    coexist with prior partitions' MPHFs.

  - The streaming saver gains an address+type-keyed
    substitution map (typed_address_sub). Type discrimination
    is necessary because in C++ a struct's address coincides
    with the address of its first member when the struct has
    standard layout, so address alone is ambiguous (visiting
    sparse_and_skew_index would otherwise match a substitution
    registered for its first member's first member, the
    codewords MPHF).

  - dictionary_builder uses a `register_sub<T>` helper that
    captures the type via typeid for each registration. The
    saver only fires the substitution when both address and
    std::type_index(typeid(T)) match.

  - Materializing build() flow loads each spilled MPHF back
    via essentials::loader so queries work afterward.

Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, plus a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.
No tmp file leaks.

After this commit, the proportional-to-input items in RAM
during build are bounded by --ram-limit:
  - pthash external-memory builder (capped at ram_limit/2),
  - one current MPHF being built (~bits/key * partition_size),
  - per-step external-sort buffers (capped),
  - the stored compact_vectors, MPHFs, etc. all spill.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
pthash's `partitioned_phf::build` builds the partitioned MPHF's
sub-partitions in parallel, with `mphf_build_config.num_threads`
sub-partitions running simultaneously. Each per-partition build
allocates a `pairs_t` of ~`avg_partition_size * 16 B` (~48 MB
with the default avg_partition_size = 3M); with -t 64 that
balloons to ~3 GB of pthash internal memory, dominating the
build's RSS regardless of how aggressively sshash spills its
own structures.

Add `util::cap_mphf_num_threads(requested, ram_limit_in_GiB)`:
budget ~ram_limit/4 GiB for pthash's per-partition build memory
and conservatively assume 64 MiB per parallel sub-partition
(48 MB pairs_t + sort temporary + slack). Apply at the two
pthash build sites (step 4 codewords MPHF, step 7.2 phase C
skew partition MPHFs).

For -g 2 -t 64 this caps pthash to 8 threads (1 GiB-pthash
budget over 64 MiB-per-thread = 16 capped further by /4 budget
fraction = 8). Build time may increase modestly; peak RSS
should drop well below the previous ~3.3 GB toward the 2 GiB
target.

Verified byte-identical output on salmonella_enterica m=7,
plus full --check on regular, --canonical, multi-thread (-t 4),
and --weighted.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The previous cap formula (ram_limit/4) was too aggressive — it
fired even at default -g 8 -t 64 where capping isn't needed.
That silently changed pthash's parallelism on configurations
the user didn't intend to be tight, breaking expectations of
"-t means t threads".

Loosen the budget to ram_limit/2: leave roughly half of
--ram-limit available to pthash's parallel sub-partition build
memory (the other half covers the streaming buffers, external
sort buffers, etc.). Cap pthash threads only when the user's
-t would push pthash past that half-budget. Common cases now
pass through unchanged:

  -g 8  -t  64    no cap (4 GiB / 64 MiB = 64 >= 64)
  -g 4  -t  32    no cap
  -g 2  -t  16    no cap
  -g 16 -t 128    no cap
  -g 2  -t  64    cap to 16 (pathological: tight budget vs many threads)
  -g 4  -t  64    cap to 32

When the cap fires, log a clear warning naming the MPHF, the
requested thread count, the cap, and the budget. Verbose mode
only.

Verified byte-identical output and full --check matrix still
pass.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Previous approach quietly capped pthash's num_threads under
tight --ram-limit, which broke the user's expectation that "-t
N means N threads". The actual lever pthash exposes for
per-thread memory is `avg_partition_size`: per-partition build
memory is roughly `avg_partition_size * sizeof(pair)` bytes.

Replace the unconditional thread cap with a configuration that
divides the pthash RAM budget (half of --ram-limit) evenly
across the user's requested threads, then derives an
`avg_partition_size` for that per-thread budget:

  - per_thread_budget = (ram_limit / 2) / num_threads
  - avg = per_thread_budget / per_key_estimate (32 B)
  - if avg >= default (3M):       use default; -t honored
  - elif avg >= floor (100K):     use avg; -t honored
  - else:                         cap -t so floor fits, warn

The warning only fires in the pathological case (so many
threads at so little RAM that even pthash's quality floor
can't fit). Common configurations pass through unchanged:

  -t 1   -g 8     no change (default partition 3M)
  -t 64  -g 8     -t honored, partition 2M
  -t 64  -g 4     -t honored, partition 1M
  -t 64  -g 2     -t honored, partition 524K
  -t 256 -g 1     warn + cap to 167 (-t couldn't be honored)

Verified byte-identical output on -t 1 builds (the default
single-thread case) and full --check matrix on regular,
--canonical, multi-thread, and --weighted.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Below ~4 GiB the streaming buffers + pthash's internal
working memory (which we don't fully control) can't
practically be made to fit; squeezing further has diminishing
returns. Rather than degrade the build into ever-tinier
buffers and ever-smaller pthash partition sizes, treat 4 GiB
as the effective floor — a modest requirement on today's
desktops.

Add `constants::min_ram_limit_in_GiB = 4` and apply it in the
single validation/normalization step at the build entrypoint
(both `dictionary::build` and `dictionary::build_streaming_save`).
A NOTE is printed in verbose mode whenever the user's `-g` is
raised. Configurations at or above the floor are unaffected.

Verified byte-identical output on the default build, full
--check on regular, --canonical, multi-thread, and --weighted.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Picks up two commits on the pthash branch:

  90ace87  Account for input hash vector in construction memory estimate
  9600827  Drop the redundant in-search hashes term and the comment

The fix corrects pthash's `estimate_num_bytes_for_construction`,
which underestimated per-partition residency by `num_keys *
sizeof(hash_type)` (= 16 B/key for hash128) and by an extra
`num_keys * 8 B` double-counted "in-search hashes" term. With
the corrected estimate, pthash's `bytes < config.ram` flush
gate in the parallel partitioned-PHF build path actually
matches residency, so `mphf_build_config.ram = ram_limit/2`
will now bind pthash's parallel build batch to that budget on
inputs like HPRC k=63 m=31 canonical (where the previous
underestimate had pthash's batch peak around 1.5x the
configured budget).

Verified byte-identical SSHash output before/after the bump
on salmonella_enterica m=7 (the change is purely a
memory-accounting fix). Full --check matrix passes.

When pthash master receives this fix we'll bump again to
that tip; for now we point at the branch tip directly so the
HPRC benchmark can validate the new RSS bound.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
After the pthash memory-estimate fix, pthash's parallel-build
batch correctly binds at `ram_limit/2`. The remaining excess
on the HPRC k=63 m=31 -g 4 -t 64 benchmark was sshash's own
buffers (each currently sized at ram_limit/4) piled on top:

  - step 5 minimizer-tuples buffer: up to RAM_available/3
    (~1.1 GB at -g 4) — and freed but typically retained by
    glibc, so it lingers in process RSS through subsequent
    steps.
  - step 7.1 kmer-extraction request buffer: ram_limit/4
    (1 GiB at -g 4).
  - step 7.2 phase C position-tuple buffer: ram_limit/4
    (1 GiB at -g 4) — alive concurrently with pthash's
    parallel-build memory and the partition's MPHF.

Halve them all (cap at ram_limit/8 = 512 MiB at -g 4). The
external sorts get more flush rounds and slightly more disk
I/O, but peak RSS during the heaviest step (7.2 phase C)
drops by roughly:

  pthash 2 GiB  +  pos buffer 0.5 GiB  +  partition MPHF 0.3 GB
  +  step-5 lingering 0.5 GiB  ≈  3.3 GiB

vs the prior ~4.8 GiB observed peak. Should fit comfortably
under -g 4.

Verified byte-identical SSHash output on salmonella_enterica
m=7 (more flush rounds = more intermediate run files but the
sort+merge is order-stable for the bytes we care about).

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
PR #97 ("fix pthash memory estimate") landed in pthash master.
This bump moves us from the (now-merged) branch tip 9600827
to the master squash-merge commit a95e814. The substantive
content of `internal_memory_builder_single_phf.hpp` is
byte-identical between the two, so SSHash's behavior is
unchanged from f68fa77.

This unpins us from the development branch URL and lets us
track pthash master going forward.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The build's various ifstream-backed forward readers all
followed the same pattern (open file, fixed-capacity record
buffer, refill on exhaust, has_next/current/advance interface).
Five separate copies of that loop existed:

  - file_merging_iterator::buffered_stream      (per merge run)
  - disk_backed_offsets_builder::reader         (per-thread offsets)
  - disk_backed_offsets_builder::full_iterator  (encode/build)
  - streaming_minimizers_iterator               (codewords MPHF input)
  - streaming_minimizer_bucket_reader           (step 7.1 buckets)
  - skew_partition_kmer_iterator                (skew MPHF input)

Hoist the common primitive into a new
`include/builder/buffered_record_stream.hpp`:

    template <typename T>
    struct buffered_record_stream {
        void open(filename, buffer_records, start_byte = 0);
        void close();
        bool empty() const;
        T const& current() const;
        void advance();
    };

and reuse it from each of the five readers. The pthash-iterator
ones (full_iterator, streaming_minimizers_iterator,
skew_partition_kmer_iterator) wrap a shared_ptr to keep the
copyable-by-value contract pthash expects.

While here, define a packed `skew_kmer_record_t<Kmer>` for the
phase B/C tmp file format and use it both at the writer and the
buffered reader, so the on-disk record layout is in one place.

Verified byte-identical SSHash output on salmonella_enterica
m=7 and full --check on regular, --canonical, multi-thread
(-t 4), and --weighted.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Currently the build CLI picks between two paths:

  -o without --check         -> dictionary::build_streaming_save
                                (spilled components are stitched
                                into the output via the streaming
                                saver; `dict` is not query-ready
                                afterward)
  -o with --check (or no -o) -> dictionary::build
                                (spilled components are
                                materialized back into `dict`,
                                then optionally essentials::save)

For users with plenty of RAM who don't want the streaming-save
tmp-file concatenation (and don't need --check), expose the
in-RAM save path explicitly via --no-streaming-save. When set,
the build does build() + essentials::save: peak RSS at save time
briefly equals the in-RAM index size, but the save is a single
pass over `dict` rather than a stitched concatenation. Useful
when the user already pays the memory cost (e.g., to query the
dict immediately afterward in another tool, or just prefers the
simpler save path).

Both flows produce byte-identical output files; the flag only
affects the save path.

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Previously the streaming-save flow skipped print_space_breakdown
(d.m_spss.strings is an empty placeholder there, so calling
breakdown would just print zeros), which meant the user got no
size summary at the end. Master always prints "total ...
bits/kmer" though.

Stat the saved file's size when `d` isn't materialized and
print the total directly. The in-RAM flow keeps the existing
per-component breakdown unchanged. Cost is one fstat at the
very end of the build.

Also fixes index_size_in_bytes in the JSON-line build_stats
output for the streaming-save flow (it used to report just the
in-RAM-resident bytes ≈ a few hundred MB instead of the actual
on-disk index size).

https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
All build-step durations in the JSON-line build_stats output were raw
microseconds, which was hard to read. Convert them to "X.XXX [sec]" via
a small `musec_as_seconds_str` helper. Steps 7.1 and 7.2 are added
directly inside build_sparse_and_skew_index.cpp, so apply the same
helper there too.

Also switch finalize_stats to std::filesystem::file_size for the saved
index size, instead of an fstream + tellg.
These two structs in include/builder/util.hpp were the in-RAM iterator
types that walked an mmap'd minimizer-tuples buffer. The build pipeline
now reads minimizers via streaming_minimizers_iterator and
streaming_minimizer_bucket_reader (both built on buffered_record_stream),
and nothing else referenced bucket_type or minimizers_tuples_iterator.

Also drop the now-stale comment references to minimizers_tuples_iterator
in the surviving streaming iterators' docstrings.
The in-RAM build path now duplicated work: it materialized all spilled
components back into RAM after step 7 just so --check could query the
in-memory dict. Drop it. Make streaming-save the only build path:

* dictionary::build now takes output_filename; the streaming variant is
  gone and there is no longer a query-ready in-memory build.
* dictionary_builder::build_streaming_save -> build, and the
  materialize_spilled_into / materialize_compact_vector_from_file
  helpers are removed (~70 LOC).
* finalize_stats no longer branches on whether `d` is populated.
* tools/build.cpp always streams to a file. If the user passed -o, that
  path is used; otherwise a tmp file under tmp_dirname is written and
  removed on exit. --check loads the saved file via open_dictionary
  with mmap=true and runs the existing correctness checks against that.
Walks through the eight build steps, what each one produces, and where
each intermediate lives between steps. Documents the two mechanisms that
together cap peak RSS at --ram-limit:

* every input-size-scaling intermediate is spilled to disk,
* every working buffer is sized as a fixed fraction of ram_limit, with
  the fractions tabulated.
The doc referred to --ram-limit and --tmp-dirname; the actual short
flags exposed by sshash build are -g (RAM limit in GiB) and -d (tmp
dir). The pthash "ram" reference was a programmatic config field, not
a CLI flag — clarified.
@jermp
Copy link
Copy Markdown
Owner Author

jermp commented May 6, 2026

@copilot resolve the merge conflicts in this pull request

…p version to 5.1.1

Co-authored-by: jermp <10065674+jermp@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

@copilot resolve the merge conflicts in this pull request

Resolved in commit 0002789. The only conflict was in include/builder/util.hpp — the alt-build branch had replaced the old pointer-based minimizers_bucket_reader with the new streaming approach, while master had fixed a bug in the old pointer-based next_begin(). I kept the streaming version from alt-build. The other files (README.md, include/constants.hpp) auto-merged cleanly.

Copilot finished work on behalf of jermp May 6, 2026 17:44
@jermp jermp merged commit 47b71ea into master May 6, 2026
16 checks passed
@jermp jermp deleted the alt-build branch May 6, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants