Alt build by jermp · Pull Request #92 · jermp/sshash

jermp · 2026-05-06T17:40:59Z

No description provided.

Cherry-picked from 1f0cdd1 on master. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The skew-index build used to do a random read into the strings bitvector for every super-kmer in every heavy-load bucket (buckets are visited in size-sorted order, so the resulting positions are essentially random across all of strings). That forces strings to be RAM-resident throughout step 7.2. This commit restructures step 7.2 into three sub-steps: (A) walk the heavy-load buckets and emit one kmer_extraction_request per super-kmer, externally sort+flushed by starting_pos (parallel_sort within a bounded RAM buffer, ~1/4 of --ram-limit). (B) merge the sorted runs and walk strings in a single forward pass; for each request extract the requested kmers and append (kmer.bits, pos_in_bucket) to a per-partition tmp file. Only kmer.bits is serialized to avoid persisting the vptr that uint_kmer_t carries via its virtual destructor. (C) for each partition, read its tmp file, build the MPHF and the positions compact vector. The skew index is assembled partition by partition. The access pattern over strings is now monotonically non-decreasing, which is the precondition for moving strings itself out of RAM in a follow-up change. Correctness verified via `sshash build --check` on salmonella_enterica (m=7), canonical mode, and salmonella_100 with 4 threads. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The strings bit-vector is the largest in-RAM structure during construction. This commit moves it onto disk for the build phase, replacing bits::bit_vector::builder with a new disk_backed_strings storage: - append_bits() during step 1 (encode_strings) writes completed words to a tmp file, keeping only a small write window in RAM (~512 KiB by default). - After freeze(), make_reader() returns a forward-monotonic reader over the file with a small read window (~512 KiB). Each reader owns its own ifstream so multiple threads can read concurrently without contention on the writer. - The reader exposes get_word64(pos) const, matching the interface that kmer_iterator<Kmer, BitVector> expects. Wiring: - compute_minimizer_tuples (step 2): each thread instantiates its own reader; per-thread RSS contribution is bounded by its window size, not by the strings size. - build_sparse_and_skew_index (step 7.1): the in-RAM d.m_spss.strings is no longer populated here; step 7.2 phase (B) reads via a reader instead. - A new step 8 materializes d.m_spss.strings from the on-disk file, immediately before the standard essentials::save path. This brings strings briefly back into RAM at the very end of the build; eliminating that final peak requires a streaming save path, which is a separate change. Cleanup: the dictionary_builder destructor removes the strings tmp file if it still exists (covers exception paths). The hash in step 5's RAM_taken_in_bytes calculation no longer counts strings (its in-RAM footprint is now just the writer window). Verified via `sshash build --check` on: - salmonella_enterica m=7 (heavy skew index, regular and canonical), - salmonella_100 m=11 -t 4 (multi-thread, 13M kmers). Save -> load -> query roundtrip on the same dataset matches 100% on the canonical query set. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Introduces a streaming-save build path that writes the entire dictionary to disk without ever materializing the strings bit-vector in RAM. Combined with the disk-backed strings storage from the previous commit, this means the strings live exclusively on disk for the lifetime of the build, eliminating the final RAM peak that step 8 previously reintroduced just before essentials::save. Mechanism: - disk_backed_strings::save_to(ostream&) emits the same byte layout as bits::bit_vector's serialization (uint64_t m_num_bits, size_t n, then n*8 bytes), reading the words from the tmp file in 64 KiB chunks. - streaming_strings_saver wraps essentials::generic_saver and overrides visit() for bits::bit_vector: when the visited instance matches a known address (d.m_spss.strings), the streaming serializer is invoked; everything else goes through the normal essentials path. Address matching avoids introducing a marker type into bits::bit_vector. API: - dictionary_builder::build() (existing) materializes strings in RAM at the end so the dictionary is query-ready (--check, etc.). - dictionary_builder::build_streaming_save() runs steps 1-7 and stream-saves directly to the output file, leaving d.m_spss.strings empty. The dictionary is *not* query-ready after this; reload from disk to query. - dictionary::build_streaming_save() exposes the new flow. tools/build.cpp uses the streaming-save path automatically when -o is given without --check; otherwise it falls back to the materializing path so --check can run queries against d. Verified by building the same input via both paths and diff'ing the output file: byte-identical for regular, --canonical, --weighted, and 4-thread builds. Loading the streaming-saved file via essentials::load and running the standalone `sshash check` and `sshash query` tools returns "EVERYTHING OK!" and 100% positive matches. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Previously step 7.1 copied every minimizer tuple of every non-singleton bucket into an in-RAM vector `tuples`, just to give bucket_type a contiguous backing store. The mmap'd `input` already provides exactly that, so the copy was pure overhead (~18 B per super-kmer in non-singleton buckets, scaling with the input). Now bucket_type stores raw minimizer_tuple pointers into input.data() directly, and `input` is kept open through step 7.2 phase (A) (which is the last consumer). After phase (A) both `buckets` and `input` are released. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted builds. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

…ry MPHF Phase C used to materialize the full per-skew-partition working set in RAM: - kmers vector (~16 B/kmer, default uint64_t-backed kmer) - positions_in_bucket vector (~4 B/kmer) - cvb_positions builder (~num_bits_per_pos / 8 B/kmer, the actual stored output) For a partition with N kmers this peaked at ~21 N bytes (e.g. 20 GB for a 1 B-kmer partition). The kmers and positions_in_bucket vectors were redundant in-RAM copies of data already on disk in the per-partition tmp file written by phase (B). This commit replaces them with two streaming passes over the tmp file: (1) MPHF build via pthash's `build_in_external_memory` driven by a small forward iterator (`skew_partition_kmer_iterator`) that reads `(kmer.bits, pos_in_bucket)` records via a shared_ptr<ifstream>. pthash spills hashes to `tmp_dirname` under a `--ram-limit / 2` RAM budget rather than holding all keys + hashes simultaneously. (2) A second sequential pass over the same tmp file fills cvb_positions: for each `(kmer, pib)` record it sets cvb_positions[F(kmer)] = pib. Only cvb_positions itself stays resident through both passes, and it's the actual stored output (not a transient). The iterator must be copyable because pthash's `build_in_external_memory` takes the iterator by value; the shared_ptr<ifstream> means copies share the underlying stream state. After the build call returns the original at the call site is unused, so the shared advancement is harmless. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted builds. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Removes the last mmap from the build hot path. The merged minimizers file used to be opened via mm::file_source and walked twice — once for stats and once to populate a size-sorted in-RAM `buckets` vector — and then walked again in step 7.2 phase (A) via pointers into the mmap. The recent "point bucket_type into the mmap" change pushed even more work onto the page cache, which violates the user's hard RAM budget (mmap pages can be resident up to file size when the machine has enough memory). This commit replaces all of that with two sequential std::ifstream passes (no mmap) over the merged minimizers file: Pass 1 (stats): unchanged in spirit. Uses a new `streaming_minimizer_bucket_reader` that buffers one bucket at a time (peak ~ max_bucket_size * 18 B). Feeds `buckets_statistics` exactly as before. Pass 2 (combined sparse + heavy + emit kmer requests): folds the former step 7.1 main pass and step 7.2 phase (A) into a single bucket-by-bucket loop. The size-sorted iteration over `buckets` is gone; instead: - `begin_buckets_of_size[s]` is precomputed from the bucket-size histogram (new accessor `buckets_statistics::num_buckets_of_size`), - mid_load positions are written via per-size cursors using `compact_vector::builder::set` instead of `push_back`, - heavy_load positions are appended in file order via a single monotone cursor, - heavy buckets emit kmer-extraction requests in-line. The `buckets` vector and the entire `tuples` array are gone. No memory mapping anywhere in step 7. RAM footprint of step 7.1 is now bounded by: - max_bucket_size * 18 B (one bucket at a time), - the sparse-index builders being assembled (proportional to non-singleton positions, bits-packed), - the kmer-extraction request buffer (~ ram_limit / 4). Output bytes differ from the previous commit because the ordering of positions inside mid_load_buckets and heavy_load_buckets is now file order instead of size-sorted order; the codewords are updated to match. The index is self-consistent: full --check passes on regular, --canonical, multi-thread (-t 4), and --weighted builds, and a streaming- save round-trip via `sshash check` and `sshash query` returns "EVERYTHING OK!" and 100% positive matches. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

This commit eliminates the last three mm::file_source usages in the build, so the only mmap left in step 7 is inside pthash's external-memory PHF builder (which manages its own RAM via config.ram). (1) file_merging_iterator: each input run is now read with a bounded buffered std::ifstream (default 4096 records per stream) instead of mm::file_source. The winner-tree merge logic is unchanged; comparisons just use the in-RAM buffer's current value rather than a pointer into mmap'd memory. RSS for the merge is now bounded by `num_runs * buffer_records * sizeof(T)` regardless of run sizes. (2) minimizers_tuples::merge: the post-rename single-file count path used to mmap the merged file via mm::file_source and walk it with minimizers_tuples_iterator. It now uses streaming_minimizer_bucket_reader (hoisted from build_sparse_and_skew_index.cpp into util.hpp) for a pure ifstream pass. (3) dictionary_builder::build_mphf: replaces mm::file_source + minimizers_tuples_iterator with a new streaming_minimizers_iterator that yields each distinct minimizer once via std::ifstream. The iterator is copyable (shared_ptr<ifstream>), as required by pthash's by-value `build_in_external_memory` signature. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted builds, plus a streaming-save round-trip (sshash check + sshash query) returning "EVERYTHING OK!" on all five suites and 100% positive matches. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The strings_offsets_builder used to be an in-RAM std::vector<uint64_t> of size num_strings + 1, finalized into d.m_spss.strings_offsets at step 7.1. For inputs with many strings this is non-trivial RSS (8 B per string), held through steps 1-7. This commit replaces it with a disk-backed disk_backed_offsets_builder<Offsets>: - push_back() during step 1 spills to a tmp file under a small in-RAM write buffer (~32 KiB), - compute_minimizer_tuples (step 2) opens one `reader` per thread, positioned at the thread's index_begin, and walks forward sequentially via a bounded read buffer (~32 KiB / thread). The per-thread `[i]`, `[i+1]` access pattern is replaced with a single rolling `prev_offset = next()` cursor. - encode(offset, begin, string_id) is now a pure const method on the disk-backed builder (depends only on m_nb and m_size), preserving multi-threaded safety. - build(target) at step 7.1 streams the file's contents via a copyable forward iterator into target.m_seq's encode/build, so neither side materializes the offsets in RAM. The on-disk file is removed after build. To keep `target.m_seq` accessible from the external builder without exposing it broadly, `offsets<Seq>` befriends `disk_backed_offsets_builder<...>` via a templated friend declaration; the concrete decoded_offsets / encoded_offsets types inherit that friendship. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted, and a streaming-save round-trip (sshash check + sshash query) with all five "EVERYTHING OK!" suites and 100% positive matches. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The four largest in-RAM "stored output" structures during the build are the sparse-index compact_vectors (control_codewords, mid_load_buckets, heavy_load_buckets) and the per-skew-partition cvb_positions. For huge inputs these can each be many GB. This commit makes the build never materialize them in RAM: - new streaming_compact_vector_writer: writes the same byte layout as bits::compact_vector::visit_impl (size, width, mask, owning_span<u64>) directly to a file via a 2-word rolling window, accepting set(index, value) in monotonic index order. Matches the +1 padding word that the in-RAM builder allocates so on-disk bytes are identical. - control_codewords: indices are mphf(minimizer), and the merged minimizers file is sorted by mphf hash, so writes during the combined pass are strictly monotonic. Streamed directly via the writer; no external sort. - heavy_load_buckets: single monotone cursor, also streamed directly via the writer. - mid_load_buckets: per-size cursors interleave across the pass, so each size class's positions are written to its own raw-uint64 tmp file (monotonic within size). After the combined pass, the per-size files are streamed in size order through a streaming_compact_vector_writer to assemble the final mid_load_buckets file. - cvb_positions per skew partition: writes are random by F(kmer). After the partition's MPHF is built, a second pass over the partition's kmer file emits (F(kmer), pos_in_bucket) tuples through the existing parallel_sort + flush + file_merging_iterator external-sort machinery; the sorted stream feeds a per-partition streaming_compact_vector_writer. The streaming saver is extended to take a substitution map keyed by `bits::compact_vector const*`. dictionary_builder populates the dictionary's compact_vector slots with empty placeholders, takes their addresses, and registers each spilled tmp file. The save pass copies bytes from each tmp file at the matching visit slot. For the materializing `build()` flow (used by --check), a new `materialize_spilled_into(d)` step re-loads each spilled tmp file back into the dictionary's in-RAM compact_vectors via essentials::loader, so queries work afterward. This brings the RAM peak back briefly at the very end (acceptable since --check inherently needs the full index in RAM). `build_streaming_save()` never materializes; the spilled tmp files are concatenated into the output by the saver and then removed. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted, plus a streaming-save round-trip (sshash check + sshash query) with all five "EVERYTHING OK!" suites and 100% positive matches. No tmp file leaks. Remaining proportional-to-input items in RAM during build: the codewords MPHF (step 4) and the per-skew-partition MPHFs (step 7.2 phase C). pthash returns these as in-memory structs; spilling them would require pthash changes or an intermediate save/load step. Everything else is now bounded by the explicit --ram-limit. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The accumulating in-RAM MPHFs were the dominant resident memory at the end of phase C. For the HPRC k=63 m=31 canonical benchmark (5.9 B kmers, ~5.4 B in skew at ~3 bits/key) they sum to ~2 GB held simultaneously through save. This commit spills them to disk and concatenates at save: - Step 5 (hash_minimizers): the codewords MPHF is no longer needed after the minimizer values are remapped. Save it to a tmp file via essentials::save and default-assign the in-RAM struct to free it. - Step 7.2 phase C: after each partition's cvb_positions has been written to disk, save that partition's MPHF to a tmp file and default-assign it. Subsequent partitions never coexist with prior partitions' MPHFs. - The streaming saver gains an address+type-keyed substitution map (typed_address_sub). Type discrimination is necessary because in C++ a struct's address coincides with the address of its first member when the struct has standard layout, so address alone is ambiguous (visiting sparse_and_skew_index would otherwise match a substitution registered for its first member's first member, the codewords MPHF). - dictionary_builder uses a `register_sub<T>` helper that captures the type via typeid for each registration. The saver only fires the substitution when both address and std::type_index(typeid(T)) match. - Materializing build() flow loads each spilled MPHF back via essentials::loader so queries work afterward. Verified byte-identical output vs the previous commit on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted, plus a streaming-save round-trip (sshash check + sshash query) with all five "EVERYTHING OK!" suites and 100% positive matches. No tmp file leaks. After this commit, the proportional-to-input items in RAM during build are bounded by --ram-limit: - pthash external-memory builder (capped at ram_limit/2), - one current MPHF being built (~bits/key * partition_size), - per-step external-sort buffers (capped), - the stored compact_vectors, MPHFs, etc. all spill. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

pthash's `partitioned_phf::build` builds the partitioned MPHF's sub-partitions in parallel, with `mphf_build_config.num_threads` sub-partitions running simultaneously. Each per-partition build allocates a `pairs_t` of ~`avg_partition_size * 16 B` (~48 MB with the default avg_partition_size = 3M); with -t 64 that balloons to ~3 GB of pthash internal memory, dominating the build's RSS regardless of how aggressively sshash spills its own structures. Add `util::cap_mphf_num_threads(requested, ram_limit_in_GiB)`: budget ~ram_limit/4 GiB for pthash's per-partition build memory and conservatively assume 64 MiB per parallel sub-partition (48 MB pairs_t + sort temporary + slack). Apply at the two pthash build sites (step 4 codewords MPHF, step 7.2 phase C skew partition MPHFs). For -g 2 -t 64 this caps pthash to 8 threads (1 GiB-pthash budget over 64 MiB-per-thread = 16 capped further by /4 budget fraction = 8). Build time may increase modestly; peak RSS should drop well below the previous ~3.3 GB toward the 2 GiB target. Verified byte-identical output on salmonella_enterica m=7, plus full --check on regular, --canonical, multi-thread (-t 4), and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The previous cap formula (ram_limit/4) was too aggressive — it fired even at default -g 8 -t 64 where capping isn't needed. That silently changed pthash's parallelism on configurations the user didn't intend to be tight, breaking expectations of "-t means t threads". Loosen the budget to ram_limit/2: leave roughly half of --ram-limit available to pthash's parallel sub-partition build memory (the other half covers the streaming buffers, external sort buffers, etc.). Cap pthash threads only when the user's -t would push pthash past that half-budget. Common cases now pass through unchanged: -g 8 -t 64 no cap (4 GiB / 64 MiB = 64 >= 64) -g 4 -t 32 no cap -g 2 -t 16 no cap -g 16 -t 128 no cap -g 2 -t 64 cap to 16 (pathological: tight budget vs many threads) -g 4 -t 64 cap to 32 When the cap fires, log a clear warning naming the MPHF, the requested thread count, the cap, and the budget. Verbose mode only. Verified byte-identical output and full --check matrix still pass. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Previous approach quietly capped pthash's num_threads under tight --ram-limit, which broke the user's expectation that "-t N means N threads". The actual lever pthash exposes for per-thread memory is `avg_partition_size`: per-partition build memory is roughly `avg_partition_size * sizeof(pair)` bytes. Replace the unconditional thread cap with a configuration that divides the pthash RAM budget (half of --ram-limit) evenly across the user's requested threads, then derives an `avg_partition_size` for that per-thread budget: - per_thread_budget = (ram_limit / 2) / num_threads - avg = per_thread_budget / per_key_estimate (32 B) - if avg >= default (3M): use default; -t honored - elif avg >= floor (100K): use avg; -t honored - else: cap -t so floor fits, warn The warning only fires in the pathological case (so many threads at so little RAM that even pthash's quality floor can't fit). Common configurations pass through unchanged: -t 1 -g 8 no change (default partition 3M) -t 64 -g 8 -t honored, partition 2M -t 64 -g 4 -t honored, partition 1M -t 64 -g 2 -t honored, partition 524K -t 256 -g 1 warn + cap to 167 (-t couldn't be honored) Verified byte-identical output on -t 1 builds (the default single-thread case) and full --check matrix on regular, --canonical, multi-thread, and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Below ~4 GiB the streaming buffers + pthash's internal working memory (which we don't fully control) can't practically be made to fit; squeezing further has diminishing returns. Rather than degrade the build into ever-tinier buffers and ever-smaller pthash partition sizes, treat 4 GiB as the effective floor — a modest requirement on today's desktops. Add `constants::min_ram_limit_in_GiB = 4` and apply it in the single validation/normalization step at the build entrypoint (both `dictionary::build` and `dictionary::build_streaming_save`). A NOTE is printed in verbose mode whenever the user's `-g` is raised. Configurations at or above the floor are unaffected. Verified byte-identical output on the default build, full --check on regular, --canonical, multi-thread, and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Picks up two commits on the pthash branch: 90ace87 Account for input hash vector in construction memory estimate 9600827 Drop the redundant in-search hashes term and the comment The fix corrects pthash's `estimate_num_bytes_for_construction`, which underestimated per-partition residency by `num_keys * sizeof(hash_type)` (= 16 B/key for hash128) and by an extra `num_keys * 8 B` double-counted "in-search hashes" term. With the corrected estimate, pthash's `bytes < config.ram` flush gate in the parallel partitioned-PHF build path actually matches residency, so `mphf_build_config.ram = ram_limit/2` will now bind pthash's parallel build batch to that budget on inputs like HPRC k=63 m=31 canonical (where the previous underestimate had pthash's batch peak around 1.5x the configured budget). Verified byte-identical SSHash output before/after the bump on salmonella_enterica m=7 (the change is purely a memory-accounting fix). Full --check matrix passes. When pthash master receives this fix we'll bump again to that tip; for now we point at the branch tip directly so the HPRC benchmark can validate the new RSS bound. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

After the pthash memory-estimate fix, pthash's parallel-build batch correctly binds at `ram_limit/2`. The remaining excess on the HPRC k=63 m=31 -g 4 -t 64 benchmark was sshash's own buffers (each currently sized at ram_limit/4) piled on top: - step 5 minimizer-tuples buffer: up to RAM_available/3 (~1.1 GB at -g 4) — and freed but typically retained by glibc, so it lingers in process RSS through subsequent steps. - step 7.1 kmer-extraction request buffer: ram_limit/4 (1 GiB at -g 4). - step 7.2 phase C position-tuple buffer: ram_limit/4 (1 GiB at -g 4) — alive concurrently with pthash's parallel-build memory and the partition's MPHF. Halve them all (cap at ram_limit/8 = 512 MiB at -g 4). The external sorts get more flush rounds and slightly more disk I/O, but peak RSS during the heaviest step (7.2 phase C) drops by roughly: pthash 2 GiB + pos buffer 0.5 GiB + partition MPHF 0.3 GB + step-5 lingering 0.5 GiB ≈ 3.3 GiB vs the prior ~4.8 GiB observed peak. Should fit comfortably under -g 4. Verified byte-identical SSHash output on salmonella_enterica m=7 (more flush rounds = more intermediate run files but the sort+merge is order-stable for the bytes we care about). https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

PR #97 ("fix pthash memory estimate") landed in pthash master. This bump moves us from the (now-merged) branch tip 9600827 to the master squash-merge commit a95e814. The substantive content of `internal_memory_builder_single_phf.hpp` is byte-identical between the two, so SSHash's behavior is unchanged from f68fa77. This unpins us from the development branch URL and lets us track pthash master going forward. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

The build's various ifstream-backed forward readers all followed the same pattern (open file, fixed-capacity record buffer, refill on exhaust, has_next/current/advance interface). Five separate copies of that loop existed: - file_merging_iterator::buffered_stream (per merge run) - disk_backed_offsets_builder::reader (per-thread offsets) - disk_backed_offsets_builder::full_iterator (encode/build) - streaming_minimizers_iterator (codewords MPHF input) - streaming_minimizer_bucket_reader (step 7.1 buckets) - skew_partition_kmer_iterator (skew MPHF input) Hoist the common primitive into a new `include/builder/buffered_record_stream.hpp`: template <typename T> struct buffered_record_stream { void open(filename, buffer_records, start_byte = 0); void close(); bool empty() const; T const& current() const; void advance(); }; and reuse it from each of the five readers. The pthash-iterator ones (full_iterator, streaming_minimizers_iterator, skew_partition_kmer_iterator) wrap a shared_ptr to keep the copyable-by-value contract pthash expects. While here, define a packed `skew_kmer_record_t<Kmer>` for the phase B/C tmp file format and use it both at the writer and the buffered reader, so the on-disk record layout is in one place. Verified byte-identical SSHash output on salmonella_enterica m=7 and full --check on regular, --canonical, multi-thread (-t 4), and --weighted. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

Currently the build CLI picks between two paths: -o without --check -> dictionary::build_streaming_save (spilled components are stitched into the output via the streaming saver; `dict` is not query-ready afterward) -o with --check (or no -o) -> dictionary::build (spilled components are materialized back into `dict`, then optionally essentials::save) For users with plenty of RAM who don't want the streaming-save tmp-file concatenation (and don't need --check), expose the in-RAM save path explicitly via --no-streaming-save. When set, the build does build() + essentials::save: peak RSS at save time briefly equals the in-RAM index size, but the save is a single pass over `dict` rather than a stitched concatenation. Useful when the user already pays the memory cost (e.g., to query the dict immediately afterward in another tool, or just prefers the simpler save path). Both flows produce byte-identical output files; the flag only affects the save path. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

This reverts commit a35c364.

Previously the streaming-save flow skipped print_space_breakdown (d.m_spss.strings is an empty placeholder there, so calling breakdown would just print zeros), which meant the user got no size summary at the end. Master always prints "total ... bits/kmer" though. Stat the saved file's size when `d` isn't materialized and print the total directly. The in-RAM flow keeps the existing per-component breakdown unchanged. Cost is one fstat at the very end of the build. Also fixes index_size_in_bytes in the JSON-line build_stats output for the streaming-save flow (it used to report just the in-RAM-resident bytes ≈ a few hundred MB instead of the actual on-disk index size). https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

All build-step durations in the JSON-line build_stats output were raw microseconds, which was hard to read. Convert them to "X.XXX [sec]" via a small `musec_as_seconds_str` helper. Steps 7.1 and 7.2 are added directly inside build_sparse_and_skew_index.cpp, so apply the same helper there too. Also switch finalize_stats to std::filesystem::file_size for the saved index size, instead of an fstream + tellg.

These two structs in include/builder/util.hpp were the in-RAM iterator types that walked an mmap'd minimizer-tuples buffer. The build pipeline now reads minimizers via streaming_minimizers_iterator and streaming_minimizer_bucket_reader (both built on buffered_record_stream), and nothing else referenced bucket_type or minimizers_tuples_iterator. Also drop the now-stale comment references to minimizers_tuples_iterator in the surviving streaming iterators' docstrings.

The in-RAM build path now duplicated work: it materialized all spilled components back into RAM after step 7 just so --check could query the in-memory dict. Drop it. Make streaming-save the only build path: * dictionary::build now takes output_filename; the streaming variant is gone and there is no longer a query-ready in-memory build. * dictionary_builder::build_streaming_save -> build, and the materialize_spilled_into / materialize_compact_vector_from_file helpers are removed (~70 LOC). * finalize_stats no longer branches on whether `d` is populated. * tools/build.cpp always streams to a file. If the user passed -o, that path is used; otherwise a tmp file under tmp_dirname is written and removed on exit. --check loads the saved file via open_dictionary with mmap=true and runs the existing correctness checks against that.

Walks through the eight build steps, what each one produces, and where each intermediate lives between steps. Documents the two mechanisms that together cap peak RSS at --ram-limit: * every input-size-scaling intermediate is spilled to disk, * every working buffer is sized as a fixed fraction of ram_limit, with the fractions tabulated.

The doc referred to --ram-limit and --tmp-dirname; the actual short flags exposed by sshash build are -g (RAM limit in GiB) and -d (tmp dir). The pthash "ram" reference was a programmatic config field, not a CLI flag — clarified.

jermp · 2026-05-06T17:41:19Z

@copilot resolve the merge conflicts in this pull request

…p version to 5.1.1 Co-authored-by: jermp <10065674+jermp@users.noreply.github.com>

Copilot · 2026-05-06T17:43:23Z

@copilot resolve the merge conflicts in this pull request

Resolved in commit 0002789. The only conflict was in include/builder/util.hpp — the alt-build branch had replaced the old pointer-based minimizers_bucket_reader with the new streaming approach, while master had fixed a bug in the old pointer-based next_begin(). I kept the streaming version from alt-build. The other files (README.md, include/constants.hpp) auto-merged cleanly.

…uffer size'

jermp and others added 30 commits April 10, 2026 19:04

pushed old changes made when travelling in Nepal: thinner bucket_type

9007363

fixed issue with minimizers_tuples_iterator

54aa00b

Cherry-picked from 1f0cdd1 on master. https://claude.ai/code/session_01BShS2GDASvEsCAbgJyQVBK

clang format

526b64b

Revert "build: add --no-streaming-save flag for in-RAM save path"

d7dc21d

This reverts commit a35c364.

minor

1d76541

minor

8abeb70

Copilot started work on behalf of jermp May 6, 2026 17:41 View session

Merge origin/master into alt-build: resolve conflict in util.hpp, bum…

0002789

…p version to 5.1.1 Co-authored-by: jermp <10065674+jermp@users.noreply.github.com>

Copilot finished work on behalf of jermp May 6, 2026 17:44

docs(build-algorithm): rephrase 'O(buffer)' as 'proportional to the b…

f19ffcc

…uffer size'

jermp merged commit 47b71ea into master May 6, 2026
16 checks passed

jermp deleted the alt-build branch May 6, 2026 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alt build#92

Alt build#92
jermp merged 33 commits intomasterfrom
alt-build

jermp commented May 6, 2026

Uh oh!

jermp commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jermp commented May 6, 2026

Uh oh!

jermp commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants