Conversation
Cherry-picked from 1f0cdd1 on master. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The skew-index build used to do a random read into the strings
bitvector for every super-kmer in every heavy-load bucket
(buckets are visited in size-sorted order, so the resulting
positions are essentially random across all of strings). That
forces strings to be RAM-resident throughout step 7.2.
This commit restructures step 7.2 into three sub-steps:
(A) walk the heavy-load buckets and emit one
kmer_extraction_request per super-kmer, externally
sort+flushed by starting_pos (parallel_sort within a
bounded RAM buffer, ~1/4 of --ram-limit).
(B) merge the sorted runs and walk strings in a single
forward pass; for each request extract the requested
kmers and append (kmer.bits, pos_in_bucket) to a
per-partition tmp file. Only kmer.bits is serialized to
avoid persisting the vptr that uint_kmer_t carries via
its virtual destructor.
(C) for each partition, read its tmp file, build the MPHF
and the positions compact vector. The skew index is
assembled partition by partition.
The access pattern over strings is now monotonically
non-decreasing, which is the precondition for moving strings
itself out of RAM in a follow-up change. Correctness verified
via `sshash build --check` on salmonella_enterica (m=7),
canonical mode, and salmonella_100 with 4 threads.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The strings bit-vector is the largest in-RAM structure during
construction. This commit moves it onto disk for the build
phase, replacing bits::bit_vector::builder with a new
disk_backed_strings storage:
- append_bits() during step 1 (encode_strings) writes
completed words to a tmp file, keeping only a small write
window in RAM (~512 KiB by default).
- After freeze(), make_reader() returns a forward-monotonic
reader over the file with a small read window (~512 KiB).
Each reader owns its own ifstream so multiple threads can
read concurrently without contention on the writer.
- The reader exposes get_word64(pos) const, matching the
interface that kmer_iterator<Kmer, BitVector> expects.
Wiring:
- compute_minimizer_tuples (step 2): each thread instantiates
its own reader; per-thread RSS contribution is bounded by
its window size, not by the strings size.
- build_sparse_and_skew_index (step 7.1): the in-RAM
d.m_spss.strings is no longer populated here; step 7.2
phase (B) reads via a reader instead.
- A new step 8 materializes d.m_spss.strings from the on-disk
file, immediately before the standard essentials::save path.
This brings strings briefly back into RAM at the very end of
the build; eliminating that final peak requires a streaming
save path, which is a separate change.
Cleanup: the dictionary_builder destructor removes the strings
tmp file if it still exists (covers exception paths). The hash
in step 5's RAM_taken_in_bytes calculation no longer counts
strings (its in-RAM footprint is now just the writer window).
Verified via `sshash build --check` on:
- salmonella_enterica m=7 (heavy skew index, regular and
canonical),
- salmonella_100 m=11 -t 4 (multi-thread, 13M kmers).
Save -> load -> query roundtrip on the same dataset matches
100% on the canonical query set.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Introduces a streaming-save build path that writes the entire
dictionary to disk without ever materializing the strings
bit-vector in RAM. Combined with the disk-backed strings
storage from the previous commit, this means the strings live
exclusively on disk for the lifetime of the build, eliminating
the final RAM peak that step 8 previously reintroduced just
before essentials::save.
Mechanism:
- disk_backed_strings::save_to(ostream&) emits the same byte
layout as bits::bit_vector's serialization (uint64_t
m_num_bits, size_t n, then n*8 bytes), reading the words
from the tmp file in 64 KiB chunks.
- streaming_strings_saver wraps essentials::generic_saver and
overrides visit() for bits::bit_vector: when the visited
instance matches a known address (d.m_spss.strings), the
streaming serializer is invoked; everything else goes
through the normal essentials path. Address matching avoids
introducing a marker type into bits::bit_vector.
API:
- dictionary_builder::build() (existing) materializes strings
in RAM at the end so the dictionary is query-ready
(--check, etc.).
- dictionary_builder::build_streaming_save() runs steps 1-7
and stream-saves directly to the output file, leaving
d.m_spss.strings empty. The dictionary is *not*
query-ready after this; reload from disk to query.
- dictionary::build_streaming_save() exposes the new flow.
tools/build.cpp uses the streaming-save path automatically when
-o is given without --check; otherwise it falls back to the
materializing path so --check can run queries against d.
Verified by building the same input via both paths and
diff'ing the output file: byte-identical for regular,
--canonical, --weighted, and 4-thread builds. Loading the
streaming-saved file via essentials::load and running the
standalone `sshash check` and `sshash query` tools
returns "EVERYTHING OK!" and 100% positive matches.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Previously step 7.1 copied every minimizer tuple of every non-singleton bucket into an in-RAM vector `tuples`, just to give bucket_type a contiguous backing store. The mmap'd `input` already provides exactly that, so the copy was pure overhead (~18 B per super-kmer in non-singleton buckets, scaling with the input). Now bucket_type stores raw minimizer_tuple pointers into input.data() directly, and `input` is kept open through step 7.2 phase (A) (which is the last consumer). After phase (A) both `buckets` and `input` are released. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted builds. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
…ry MPHF
Phase C used to materialize the full per-skew-partition working
set in RAM:
- kmers vector (~16 B/kmer, default uint64_t-backed kmer)
- positions_in_bucket vector (~4 B/kmer)
- cvb_positions builder (~num_bits_per_pos / 8 B/kmer, the
actual stored output)
For a partition with N kmers this peaked at ~21 N bytes (e.g.
20 GB for a 1 B-kmer partition). The kmers and
positions_in_bucket vectors were redundant in-RAM copies of
data already on disk in the per-partition tmp file written by
phase (B).
This commit replaces them with two streaming passes over the
tmp file:
(1) MPHF build via pthash's `build_in_external_memory` driven
by a small forward iterator (`skew_partition_kmer_iterator`)
that reads `(kmer.bits, pos_in_bucket)` records via a
shared_ptr<ifstream>. pthash spills hashes to
`tmp_dirname` under a `--ram-limit / 2` RAM budget rather
than holding all keys + hashes simultaneously.
(2) A second sequential pass over the same tmp file fills
cvb_positions: for each `(kmer, pib)` record it sets
cvb_positions[F(kmer)] = pib.
Only cvb_positions itself stays resident through both passes,
and it's the actual stored output (not a transient).
The iterator must be copyable because pthash's
`build_in_external_memory` takes the iterator by value; the
shared_ptr<ifstream> means copies share the underlying stream
state. After the build call returns the original at the call
site is unused, so the shared advancement is harmless.
Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted builds.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Removes the last mmap from the build hot path. The merged
minimizers file used to be opened via mm::file_source and
walked twice — once for stats and once to populate a
size-sorted in-RAM `buckets` vector — and then walked again
in step 7.2 phase (A) via pointers into the mmap. The recent
"point bucket_type into the mmap" change pushed even more
work onto the page cache, which violates the user's hard RAM
budget (mmap pages can be resident up to file size when the
machine has enough memory).
This commit replaces all of that with two sequential
std::ifstream passes (no mmap) over the merged minimizers
file:
Pass 1 (stats): unchanged in spirit. Uses a new
`streaming_minimizer_bucket_reader` that buffers one
bucket at a time (peak ~ max_bucket_size * 18 B). Feeds
`buckets_statistics` exactly as before.
Pass 2 (combined sparse + heavy + emit kmer requests):
folds the former step 7.1 main pass and step 7.2 phase
(A) into a single bucket-by-bucket loop. The size-sorted
iteration over `buckets` is gone; instead:
- `begin_buckets_of_size[s]` is precomputed from the
bucket-size histogram (new accessor
`buckets_statistics::num_buckets_of_size`),
- mid_load positions are written via per-size cursors
using `compact_vector::builder::set` instead of
`push_back`,
- heavy_load positions are appended in file order via a
single monotone cursor,
- heavy buckets emit kmer-extraction requests in-line.
The `buckets` vector and the entire `tuples` array are gone.
No memory mapping anywhere in step 7. RAM footprint of step
7.1 is now bounded by:
- max_bucket_size * 18 B (one bucket at a time),
- the sparse-index builders being assembled (proportional
to non-singleton positions, bits-packed),
- the kmer-extraction request buffer (~ ram_limit / 4).
Output bytes differ from the previous commit because the
ordering of positions inside mid_load_buckets and
heavy_load_buckets is now file order instead of size-sorted
order; the codewords are updated to match. The index is
self-consistent: full --check passes on regular, --canonical,
multi-thread (-t 4), and --weighted builds, and a streaming-
save round-trip via `sshash check` and `sshash query`
returns "EVERYTHING OK!" and 100% positive matches.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
This commit eliminates the last three mm::file_source usages
in the build, so the only mmap left in step 7 is inside
pthash's external-memory PHF builder (which manages its own
RAM via config.ram).
(1) file_merging_iterator: each input run is now read with a
bounded buffered std::ifstream (default 4096 records per
stream) instead of mm::file_source. The winner-tree merge
logic is unchanged; comparisons just use the in-RAM
buffer's current value rather than a pointer into mmap'd
memory. RSS for the merge is now bounded by
`num_runs * buffer_records * sizeof(T)` regardless of run
sizes.
(2) minimizers_tuples::merge: the post-rename single-file
count path used to mmap the merged file via
mm::file_source and walk it with minimizers_tuples_iterator.
It now uses streaming_minimizer_bucket_reader (hoisted
from build_sparse_and_skew_index.cpp into util.hpp) for a
pure ifstream pass.
(3) dictionary_builder::build_mphf: replaces mm::file_source
+ minimizers_tuples_iterator with a new
streaming_minimizers_iterator that yields each distinct
minimizer once via std::ifstream. The iterator is
copyable (shared_ptr<ifstream>), as required by pthash's
by-value `build_in_external_memory` signature.
Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted builds, plus
a streaming-save round-trip (sshash check + sshash query)
returning "EVERYTHING OK!" on all five suites and 100%
positive matches.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The strings_offsets_builder used to be an in-RAM
std::vector<uint64_t> of size num_strings + 1, finalized into
d.m_spss.strings_offsets at step 7.1. For inputs with many
strings this is non-trivial RSS (8 B per string), held
through steps 1-7.
This commit replaces it with a disk-backed
disk_backed_offsets_builder<Offsets>:
- push_back() during step 1 spills to a tmp file under a
small in-RAM write buffer (~32 KiB),
- compute_minimizer_tuples (step 2) opens one
`reader` per thread, positioned at the thread's
index_begin, and walks forward sequentially via a
bounded read buffer (~32 KiB / thread). The per-thread
`[i]`, `[i+1]` access pattern is replaced with a single
rolling `prev_offset = next()` cursor.
- encode(offset, begin, string_id) is now a pure const
method on the disk-backed builder (depends only on
m_nb and m_size), preserving multi-threaded safety.
- build(target) at step 7.1 streams the file's contents
via a copyable forward iterator into target.m_seq's
encode/build, so neither side materializes the offsets
in RAM. The on-disk file is removed after build.
To keep `target.m_seq` accessible from the external builder
without exposing it broadly, `offsets<Seq>` befriends
`disk_backed_offsets_builder<...>` via a templated friend
declaration; the concrete decoded_offsets / encoded_offsets
types inherit that friendship.
Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, and a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The four largest in-RAM "stored output" structures during the
build are the sparse-index compact_vectors (control_codewords,
mid_load_buckets, heavy_load_buckets) and the per-skew-partition
cvb_positions. For huge inputs these can each be many GB. This
commit makes the build never materialize them in RAM:
- new streaming_compact_vector_writer: writes the same byte
layout as bits::compact_vector::visit_impl (size, width,
mask, owning_span<u64>) directly to a file via a 2-word
rolling window, accepting set(index, value) in monotonic
index order. Matches the +1 padding word that the in-RAM
builder allocates so on-disk bytes are identical.
- control_codewords: indices are mphf(minimizer), and the
merged minimizers file is sorted by mphf hash, so writes
during the combined pass are strictly monotonic. Streamed
directly via the writer; no external sort.
- heavy_load_buckets: single monotone cursor, also streamed
directly via the writer.
- mid_load_buckets: per-size cursors interleave across the
pass, so each size class's positions are written to its
own raw-uint64 tmp file (monotonic within size). After the
combined pass, the per-size files are streamed in size
order through a streaming_compact_vector_writer to assemble
the final mid_load_buckets file.
- cvb_positions per skew partition: writes are random by
F(kmer). After the partition's MPHF is built, a second
pass over the partition's kmer file emits
(F(kmer), pos_in_bucket) tuples through the existing
parallel_sort + flush + file_merging_iterator external-sort
machinery; the sorted stream feeds a per-partition
streaming_compact_vector_writer.
The streaming saver is extended to take a substitution map
keyed by `bits::compact_vector const*`. dictionary_builder
populates the dictionary's compact_vector slots with empty
placeholders, takes their addresses, and registers each
spilled tmp file. The save pass copies bytes from each tmp
file at the matching visit slot.
For the materializing `build()` flow (used by --check), a new
`materialize_spilled_into(d)` step re-loads each spilled tmp
file back into the dictionary's in-RAM compact_vectors via
essentials::loader, so queries work afterward. This brings
the RAM peak back briefly at the very end (acceptable since
--check inherently needs the full index in RAM).
`build_streaming_save()` never materializes; the spilled tmp
files are concatenated into the output by the saver and then
removed.
Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, plus a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.
No tmp file leaks.
Remaining proportional-to-input items in RAM during build:
the codewords MPHF (step 4) and the per-skew-partition MPHFs
(step 7.2 phase C). pthash returns these as in-memory
structs; spilling them would require pthash changes or an
intermediate save/load step. Everything else is now bounded
by the explicit --ram-limit.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The accumulating in-RAM MPHFs were the dominant resident
memory at the end of phase C. For the HPRC k=63 m=31 canonical
benchmark (5.9 B kmers, ~5.4 B in skew at ~3 bits/key) they
sum to ~2 GB held simultaneously through save.
This commit spills them to disk and concatenates at save:
- Step 5 (hash_minimizers): the codewords MPHF is no longer
needed after the minimizer values are remapped. Save it to
a tmp file via essentials::save and default-assign the
in-RAM struct to free it.
- Step 7.2 phase C: after each partition's cvb_positions has
been written to disk, save that partition's MPHF to a tmp
file and default-assign it. Subsequent partitions never
coexist with prior partitions' MPHFs.
- The streaming saver gains an address+type-keyed
substitution map (typed_address_sub). Type discrimination
is necessary because in C++ a struct's address coincides
with the address of its first member when the struct has
standard layout, so address alone is ambiguous (visiting
sparse_and_skew_index would otherwise match a substitution
registered for its first member's first member, the
codewords MPHF).
- dictionary_builder uses a `register_sub<T>` helper that
captures the type via typeid for each registration. The
saver only fires the substitution when both address and
std::type_index(typeid(T)) match.
- Materializing build() flow loads each spilled MPHF back
via essentials::loader so queries work afterward.
Verified byte-identical output vs the previous commit on
salmonella_enterica m=7, plus full --check on regular,
--canonical, multi-thread (-t 4), and --weighted, plus a
streaming-save round-trip (sshash check + sshash query) with
all five "EVERYTHING OK!" suites and 100% positive matches.
No tmp file leaks.
After this commit, the proportional-to-input items in RAM
during build are bounded by --ram-limit:
- pthash external-memory builder (capped at ram_limit/2),
- one current MPHF being built (~bits/key * partition_size),
- per-step external-sort buffers (capped),
- the stored compact_vectors, MPHFs, etc. all spill.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
pthash's `partitioned_phf::build` builds the partitioned MPHF's sub-partitions in parallel, with `mphf_build_config.num_threads` sub-partitions running simultaneously. Each per-partition build allocates a `pairs_t` of ~`avg_partition_size * 16 B` (~48 MB with the default avg_partition_size = 3M); with -t 64 that balloons to ~3 GB of pthash internal memory, dominating the build's RSS regardless of how aggressively sshash spills its own structures. Add `util::cap_mphf_num_threads(requested, ram_limit_in_GiB)`: budget ~ram_limit/4 GiB for pthash's per-partition build memory and conservatively assume 64 MiB per parallel sub-partition (48 MB pairs_t + sort temporary + slack). Apply at the two pthash build sites (step 4 codewords MPHF, step 7.2 phase C skew partition MPHFs). For -g 2 -t 64 this caps pthash to 8 threads (1 GiB-pthash budget over 64 MiB-per-thread = 16 capped further by /4 budget fraction = 8). Build time may increase modestly; peak RSS should drop well below the previous ~3.3 GB toward the 2 GiB target. Verified byte-identical output on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The previous cap formula (ram_limit/4) was too aggressive — it fired even at default -g 8 -t 64 where capping isn't needed. That silently changed pthash's parallelism on configurations the user didn't intend to be tight, breaking expectations of "-t means t threads". Loosen the budget to ram_limit/2: leave roughly half of --ram-limit available to pthash's parallel sub-partition build memory (the other half covers the streaming buffers, external sort buffers, etc.). Cap pthash threads only when the user's -t would push pthash past that half-budget. Common cases now pass through unchanged: -g 8 -t 64 no cap (4 GiB / 64 MiB = 64 >= 64) -g 4 -t 32 no cap -g 2 -t 16 no cap -g 16 -t 128 no cap -g 2 -t 64 cap to 16 (pathological: tight budget vs many threads) -g 4 -t 64 cap to 32 When the cap fires, log a clear warning naming the MPHF, the requested thread count, the cap, and the budget. Verbose mode only. Verified byte-identical output and full --check matrix still pass. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Previous approach quietly capped pthash's num_threads under tight --ram-limit, which broke the user's expectation that "-t N means N threads". The actual lever pthash exposes for per-thread memory is `avg_partition_size`: per-partition build memory is roughly `avg_partition_size * sizeof(pair)` bytes. Replace the unconditional thread cap with a configuration that divides the pthash RAM budget (half of --ram-limit) evenly across the user's requested threads, then derives an `avg_partition_size` for that per-thread budget: - per_thread_budget = (ram_limit / 2) / num_threads - avg = per_thread_budget / per_key_estimate (32 B) - if avg >= default (3M): use default; -t honored - elif avg >= floor (100K): use avg; -t honored - else: cap -t so floor fits, warn The warning only fires in the pathological case (so many threads at so little RAM that even pthash's quality floor can't fit). Common configurations pass through unchanged: -t 1 -g 8 no change (default partition 3M) -t 64 -g 8 -t honored, partition 2M -t 64 -g 4 -t honored, partition 1M -t 64 -g 2 -t honored, partition 524K -t 256 -g 1 warn + cap to 167 (-t couldn't be honored) Verified byte-identical output on -t 1 builds (the default single-thread case) and full --check matrix on regular, --canonical, multi-thread, and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Below ~4 GiB the streaming buffers + pthash's internal working memory (which we don't fully control) can't practically be made to fit; squeezing further has diminishing returns. Rather than degrade the build into ever-tinier buffers and ever-smaller pthash partition sizes, treat 4 GiB as the effective floor — a modest requirement on today's desktops. Add `constants::min_ram_limit_in_GiB = 4` and apply it in the single validation/normalization step at the build entrypoint (both `dictionary::build` and `dictionary::build_streaming_save`). A NOTE is printed in verbose mode whenever the user's `-g` is raised. Configurations at or above the floor are unaffected. Verified byte-identical output on the default build, full --check on regular, --canonical, multi-thread, and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Picks up two commits on the pthash branch: 90ace87 Account for input hash vector in construction memory estimate 9600827 Drop the redundant in-search hashes term and the comment The fix corrects pthash's `estimate_num_bytes_for_construction`, which underestimated per-partition residency by `num_keys * sizeof(hash_type)` (= 16 B/key for hash128) and by an extra `num_keys * 8 B` double-counted "in-search hashes" term. With the corrected estimate, pthash's `bytes < config.ram` flush gate in the parallel partitioned-PHF build path actually matches residency, so `mphf_build_config.ram = ram_limit/2` will now bind pthash's parallel build batch to that budget on inputs like HPRC k=63 m=31 canonical (where the previous underestimate had pthash's batch peak around 1.5x the configured budget). Verified byte-identical SSHash output before/after the bump on salmonella_enterica m=7 (the change is purely a memory-accounting fix). Full --check matrix passes. When pthash master receives this fix we'll bump again to that tip; for now we point at the branch tip directly so the HPRC benchmark can validate the new RSS bound. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
After the pthash memory-estimate fix, pthash's parallel-build
batch correctly binds at `ram_limit/2`. The remaining excess
on the HPRC k=63 m=31 -g 4 -t 64 benchmark was sshash's own
buffers (each currently sized at ram_limit/4) piled on top:
- step 5 minimizer-tuples buffer: up to RAM_available/3
(~1.1 GB at -g 4) — and freed but typically retained by
glibc, so it lingers in process RSS through subsequent
steps.
- step 7.1 kmer-extraction request buffer: ram_limit/4
(1 GiB at -g 4).
- step 7.2 phase C position-tuple buffer: ram_limit/4
(1 GiB at -g 4) — alive concurrently with pthash's
parallel-build memory and the partition's MPHF.
Halve them all (cap at ram_limit/8 = 512 MiB at -g 4). The
external sorts get more flush rounds and slightly more disk
I/O, but peak RSS during the heaviest step (7.2 phase C)
drops by roughly:
pthash 2 GiB + pos buffer 0.5 GiB + partition MPHF 0.3 GB
+ step-5 lingering 0.5 GiB ≈ 3.3 GiB
vs the prior ~4.8 GiB observed peak. Should fit comfortably
under -g 4.
Verified byte-identical SSHash output on salmonella_enterica
m=7 (more flush rounds = more intermediate run files but the
sort+merge is order-stable for the bytes we care about).
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
PR #97 ("fix pthash memory estimate") landed in pthash master.
This bump moves us from the (now-merged) branch tip 9600827
to the master squash-merge commit a95e814. The substantive
content of `internal_memory_builder_single_phf.hpp` is
byte-identical between the two, so SSHash's behavior is
unchanged from f68fa77.
This unpins us from the development branch URL and lets us
track pthash master going forward.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
The build's various ifstream-backed forward readers all
followed the same pattern (open file, fixed-capacity record
buffer, refill on exhaust, has_next/current/advance interface).
Five separate copies of that loop existed:
- file_merging_iterator::buffered_stream (per merge run)
- disk_backed_offsets_builder::reader (per-thread offsets)
- disk_backed_offsets_builder::full_iterator (encode/build)
- streaming_minimizers_iterator (codewords MPHF input)
- streaming_minimizer_bucket_reader (step 7.1 buckets)
- skew_partition_kmer_iterator (skew MPHF input)
Hoist the common primitive into a new
`include/builder/buffered_record_stream.hpp`:
template <typename T>
struct buffered_record_stream {
void open(filename, buffer_records, start_byte = 0);
void close();
bool empty() const;
T const& current() const;
void advance();
};
and reuse it from each of the five readers. The pthash-iterator
ones (full_iterator, streaming_minimizers_iterator,
skew_partition_kmer_iterator) wrap a shared_ptr to keep the
copyable-by-value contract pthash expects.
While here, define a packed `skew_kmer_record_t<Kmer>` for the
phase B/C tmp file format and use it both at the writer and the
buffered reader, so the on-disk record layout is in one place.
Verified byte-identical SSHash output on salmonella_enterica
m=7 and full --check on regular, --canonical, multi-thread
(-t 4), and --weighted.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
Currently the build CLI picks between two paths:
-o without --check -> dictionary::build_streaming_save
(spilled components are stitched
into the output via the streaming
saver; `dict` is not query-ready
afterward)
-o with --check (or no -o) -> dictionary::build
(spilled components are
materialized back into `dict`,
then optionally essentials::save)
For users with plenty of RAM who don't want the streaming-save
tmp-file concatenation (and don't need --check), expose the
in-RAM save path explicitly via --no-streaming-save. When set,
the build does build() + essentials::save: peak RSS at save time
briefly equals the in-RAM index size, but the save is a single
pass over `dict` rather than a stitched concatenation. Useful
when the user already pays the memory cost (e.g., to query the
dict immediately afterward in another tool, or just prefers the
simpler save path).
Both flows produce byte-identical output files; the flag only
affects the save path.
https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
This reverts commit a35c364.
Previously the streaming-save flow skipped print_space_breakdown (d.m_spss.strings is an empty placeholder there, so calling breakdown would just print zeros), which meant the user got no size summary at the end. Master always prints "total ... bits/kmer" though. Stat the saved file's size when `d` isn't materialized and print the total directly. The in-RAM flow keeps the existing per-component breakdown unchanged. Cost is one fstat at the very end of the build. Also fixes index_size_in_bytes in the JSON-line build_stats output for the streaming-save flow (it used to report just the in-RAM-resident bytes ≈ a few hundred MB instead of the actual on-disk index size). https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK
All build-step durations in the JSON-line build_stats output were raw microseconds, which was hard to read. Convert them to "X.XXX [sec]" via a small `musec_as_seconds_str` helper. Steps 7.1 and 7.2 are added directly inside build_sparse_and_skew_index.cpp, so apply the same helper there too. Also switch finalize_stats to std::filesystem::file_size for the saved index size, instead of an fstream + tellg.
These two structs in include/builder/util.hpp were the in-RAM iterator types that walked an mmap'd minimizer-tuples buffer. The build pipeline now reads minimizers via streaming_minimizers_iterator and streaming_minimizer_bucket_reader (both built on buffered_record_stream), and nothing else referenced bucket_type or minimizers_tuples_iterator. Also drop the now-stale comment references to minimizers_tuples_iterator in the surviving streaming iterators' docstrings.
The in-RAM build path now duplicated work: it materialized all spilled components back into RAM after step 7 just so --check could query the in-memory dict. Drop it. Make streaming-save the only build path: * dictionary::build now takes output_filename; the streaming variant is gone and there is no longer a query-ready in-memory build. * dictionary_builder::build_streaming_save -> build, and the materialize_spilled_into / materialize_compact_vector_from_file helpers are removed (~70 LOC). * finalize_stats no longer branches on whether `d` is populated. * tools/build.cpp always streams to a file. If the user passed -o, that path is used; otherwise a tmp file under tmp_dirname is written and removed on exit. --check loads the saved file via open_dictionary with mmap=true and runs the existing correctness checks against that.
Walks through the eight build steps, what each one produces, and where each intermediate lives between steps. Documents the two mechanisms that together cap peak RSS at --ram-limit: * every input-size-scaling intermediate is spilled to disk, * every working buffer is sized as a fixed fraction of ram_limit, with the fractions tabulated.
The doc referred to --ram-limit and --tmp-dirname; the actual short flags exposed by sshash build are -g (RAM limit in GiB) and -d (tmp dir). The pthash "ram" reference was a programmatic config field, not a CLI flag — clarified.
Owner
Author
|
@copilot resolve the merge conflicts in this pull request |
…p version to 5.1.1 Co-authored-by: jermp <10065674+jermp@users.noreply.github.com>
Contributor
Resolved in commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.