Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
9007363
pushed old changes made when travelling in Nepal: thinner bucket_type
jermp Apr 10, 2026
54aa00b
fixed issue with minimizers_tuples_iterator
claude May 4, 2026
259b7fc
step 7.2: stream over strings instead of random access
claude May 4, 2026
63ea2bd
strings: stream to disk during build, read via small windows
claude May 4, 2026
54e98eb
streaming dictionary save (no full strings in RAM at any point)
claude May 4, 2026
e5d2612
step 7.1: drop redundant tuples copy; point bucket_type into mmap
claude May 4, 2026
0301201
step 7.2 phase C: stream per-partition kmers from disk; external-memo…
claude May 4, 2026
5f9ec80
step 7.1 + 7.2 phase A: drop mmap; single ifstream pass
claude May 5, 2026
a6ac614
remove all remaining mmap from the SSHash build path
claude May 5, 2026
fe44326
strings_offsets: stream to disk during build
claude May 5, 2026
27c71e8
spill the step-7 compact_vectors to disk; concatenate at save
claude May 5, 2026
2c73e09
spill the codewords + per-skew-partition MPHFs to disk
claude May 5, 2026
dddee47
cap pthash mphf num_threads by --ram-limit
claude May 5, 2026
e18b9b4
mphf thread cap: only kick in when budget would actually be exceeded
claude May 5, 2026
869f901
mphf: scale avg_partition_size to honor -t; cap only when pathological
claude May 5, 2026
8e4b0d8
clamp --ram-limit to a 4 GiB floor
claude May 5, 2026
f68fa77
bump pthash to claude/fix-pthash-memory-estimate-NhgTI
claude May 6, 2026
c550d53
tighten ram-proportional buffer caps from ram/4 to ram/8
claude May 6, 2026
365758b
bump pthash submodule to master tip a95e814
claude May 6, 2026
526b64b
clang format
jermp May 6, 2026
b3e49c9
factor out buffered_record_stream<T>; remove duplicated read loops
claude May 6, 2026
a35c364
build: add --no-streaming-save flag for in-RAM save path
claude May 6, 2026
d7dc21d
Revert "build: add --no-streaming-save flag for in-RAM save path"
claude May 6, 2026
d832851
finalize_stats: print total bits/kmer also in streaming-save flow
claude May 6, 2026
030f1d0
build_stats: format step timings as seconds with [sec] unit
claude May 6, 2026
659b956
remove dead bucket_type / minimizers_tuples_iterator
claude May 6, 2026
1d76541
minor
jermp May 6, 2026
033d59f
always build via streaming-save; mmap the saved file for --check
claude May 6, 2026
8abeb70
minor
jermp May 6, 2026
5aff9f7
docs: add build-algorithm.md describing the streaming build pipeline
claude May 6, 2026
730f0ad
docs(build-algorithm): use real CLI flag names (-g, -d)
claude May 6, 2026
0002789
Merge origin/master into alt-build: resolve conflict in util.hpp, bum…
Copilot May 6, 2026
f19ffcc
docs(build-algorithm): rephrase 'O(buffer)' as 'proportional to the b…
claude May 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions build-algorithm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# SSHash build algorithm

This note describes how `sshash build` constructs a dictionary while keeping
peak resident memory bounded by the user-supplied `-g` (in GiB).

The design has two ideas, applied uniformly:

1. **Spill, don't accumulate.** Every intermediate that grows with the input
size is written to a tmp file under `-d` (tmp dir) rather than held in a
`std::vector` / bit-vector in RAM. Producers append through a small write
buffer; consumers re-read through a small read buffer
(`buffered_record_stream<T>`).
2. **Cap working buffers at a fraction of `-g`.** Buffers that live
only inside one step are sized as `ram_limit_in_GiB · GiB / N` (with `N`
typically 2 or 8). The constants are picked so that even when several
buffers are alive at the same time across overlapping steps, their sum
stays under the user budget while heap fragmentation across step
transitions is absorbed.

The build never materializes the final index in RAM. Instead, step 8
streams it directly to the user-supplied output file (or a tmp file, deleted
on exit, when the user did not pass `-o`). To run `--check`, `tools/build.cpp`
mmaps the saved file and runs the correctness queries against the mmap'd
dictionary.

---

## Pipeline overview

The orchestration is in `include/builder/dictionary_builder.hpp`
(`run_steps_1_through_7` + `build`). Per-step details are in
`src/builder/{encode_strings,compute_minimizer_tuples,build_sparse_and_skew_index}.cpp`.

| Step | What it produces | Where it lives between steps |
|------|------------------|------------------------------|
| 1 | Encoded `strings` bit-vector + `strings_offsets` | tmp files (`disk_backed_strings`, `disk_backed_offsets_builder`) |
| 1.1 | Compressed weights (only if `--weighted`) | `weights::builder` (in-RAM, bounded by run-length structure) |
| 2 | Per-thread sorted runs of `minimizer_tuple` | tmp files, one per flushed buffer |
| 3 | Single sorted run of all `minimizer_tuple`s | tmp file (k-way external merge) |
| 4 | Minimizers MPHF F | tmp file (spilled at end of step 5) |
| 5 | Minimizer values replaced by F(minimizer); buffers re-flushed in F-order | new sorted runs, tmp files |
| 6 | Single sorted run keyed by F(minimizer) | tmp file |
| 7.1 | Sparse-index components (`control_codewords`, `mid_load_buckets`) | tmp files |
| 7.2 | Skew-index components (`heavy_load_buckets`, per-partition MPHFs and `positions`) | tmp files |
| 8 | Final on-disk index file | streamed to output, tmp files removed |

After step 8 the dictionary object `d` is **not** query-ready: the spilled
components were copied into the output file but never read back into `d`.
`finalize_stats` reports `index_size_in_bytes` via `std::filesystem::file_size`
on the saved path.

---

## Step 1 — encode strings (`encode_strings.cpp`)

Iterates the input FASTA, producing the 2-bit-packed `strings` bit-vector
and the `strings_offsets` array (one offset per sequence + a sentinel).
Both go through disk-backed builders:

- **`disk_backed_strings`**: appends 2-bit characters into a small in-RAM
word buffer; flushes the buffer to a tmp file when full.
- **`disk_backed_offsets_builder<Offsets>`**: appends one `uint64_t` offset
per sequence into a small write buffer; flushes to a tmp file.

In-RAM footprint of step 1 is proportional to the buffer size, regardless of
input size.

## Step 1.1 — weights (optional)

Only runs with `--weighted`. The weights builder uses run-length encoding:
its in-RAM size is proportional to the number of distinct weights, not to
the number of k-mers.

## Step 2 — compute minimizer tuples (`compute_minimizer_tuples.cpp`)

Each thread streams its assigned slice of the input via the disk-backed
strings/offsets readers and emits `minimizer_tuple` records into a private
in-RAM buffer:

```cpp
buffer_size = (ram_limit · GiB) / (2 · sizeof(minimizer_tuple) · num_threads)
```

When the buffer fills, the thread sorts it in parallel and flushes a sorted
run to a tmp file (`minimizers_tuples::sort_and_flush`). The factor of 2
in the denominator leaves headroom for `std::sort`'s allocations and
inter-thread contention; the per-thread split makes the total in-RAM tuple
buffer ≈ `ram_limit / 2`.

## Step 3 — k-way external merge (`minimizers_tuples::merge`)

The N tmp files from step 2 are merged into a single sorted run via a
**winner-tree-based external-merge iterator** (`file_merging_iterator<T>`).
Each input file is read through a `buffered_record_stream<minimizer_tuple>`
with `default_buffer_records = 4096` records, so the total in-RAM merge
state is `N · 4096 · sizeof(minimizer_tuple)` ≈ tens of MB even for very
many runs. The output is written through a small `std::ofstream` buffer.

When N == 1 the merge degenerates to a rename + a single streaming scan to
collect bucket statistics; same RAM bound.

## Step 4 — build minimizers MPHF

Builds an external-memory partitioned PHF over distinct minimizers, using
pthash's `build_in_external_memory`. The minimizers are streamed from the
sorted run via `streaming_minimizers_iterator` (one buffered ifstream),
and pthash spills its own working hashes under `tmp_dirname` capped by
`mphf_build_config.ram = ram_limit / 2`.

## Step 5 — replace minimizer values with F(minimizer)

The merged file is re-read in fixed-size blocks; each block is hashed in
parallel and re-flushed as a new sorted run. Two RAM caps are combined:

```cpp
RAM_available = ram_limit · GiB − sizeof(F) − offsets_builder.num_bytes()
buffer_unbounded = RAM_available / (3 · sizeof(minimizer_tuple)) // 3× = read+sort scratch+write
buffer_cap = (ram_limit · GiB / 8) / sizeof(minimizer_tuple)
buffer_size = min(buffer_unbounded, buffer_cap)
```

The `/ 8` cap exists because step 5 leaves heap pages dirtied that linger
into later steps' allocations; capping at one-eighth of the budget keeps
the cumulative RSS under `ram_limit` when steps 6/7 start allocating.

After step 5, the minimizers MPHF F is **spilled to disk** and the in-RAM
copy is freed: subsequent steps only ever use F(minimizer) values, not F
itself.

## Step 6 — re-merge in F-order

Same machinery as step 3, applied to the new sorted runs from step 5.

## Step 7.1 — sparse index (`build_sparse_and_skew_index.cpp`)

Constructs `control_codewords` and `mid_load_buckets`. Both are produced as
on-disk `bits::compact_vector` files via `streaming_compact_vector_writer`,
so neither is ever materialized in RAM.

## Step 7.2 — skew index

The most RAM-sensitive step; it has three internal phases, all
disk-backed:

- **Phase B (k-mer extraction requests).** Heavy-bucket entries become
`kmer_extraction_request` records. They are external-sorted by
`starting_pos` so that k-mer extraction reduces to a single forward
scan over `strings`. The request buffer is capped at
`ram_limit / 8 / sizeof(kmer_extraction_request)`; flushed runs are
merged with `file_merging_iterator`.
- **Per-partition kmer files.** While walking `strings` in request-sorted
order, each extracted k-mer is written to its partition's tmp file via
a buffered writer; this file is the input to the partition's MPHF.
- **Phase C (per-partition MPHF + `positions`).** For each skew partition:
1. Build the partition MPHF with pthash external-memory (`ram = ram_limit / 2`,
iterator: `skew_partition_kmer_iterator` over the partition's tmp file).
2. Stream-read the partition file again, emit `(F(kmer), pos_in_bucket)`
tuples; external-sort them in `ram_limit / 8`-sized buffers and merge.
3. Pack the sorted tuples into the partition's `positions`
compact_vector via `streaming_compact_vector_writer`.

Only the freshly-built MPHF for the *current* partition lives in RAM
during phase C; once spilled (`essentials::save`), it is freed before the
next partition starts. `positions` is fully on-disk.

## Step 8 — stream-save (`include/builder/streaming_save.hpp`)

The dictionary `d` is walked by `essentials::saver`, but every spilled
component is intercepted via an **address+type-keyed substitution map**
(`typed_address_sub`): when the saver visits a registered (address, type)
pair, it appends the bytes of the corresponding tmp file straight into the
output stream instead of reading from `d`. The strings bit-vector goes
through the same mechanism via `disk_backed_strings`.

Concretely, the registered substitutions are:

| Component | Source tmp file |
|-----------------------------------|----------------------------------------|
| `m_ssi.codewords.control_codewords` | step 7.1 |
| `m_ssi.mid_load_buckets` | step 7.1 |
| `m_ssi.ski.heavy_load_buckets` | step 7.2 phase B |
| `m_ssi.codewords.mphf` | step 5 spill |
| `m_ssi.ski.positions[i]` | step 7.2 phase C, per partition |
| `m_ssi.ski.mphfs[i]` | step 7.2 phase C, per partition |
| `m_spss.strings` | step 1 (`disk_backed_strings`) |

Because the substitutions are by `(address, type)` pair, a struct's address
coinciding with its first member's address does not cause confusion.

After step 8 returns, the tmp files are removed and `finalize_stats` reads
the saved file's size with `std::filesystem::file_size`.

---

## How the RAM cap is enforced — summary

The on-disk index size grows with `num_kmers`. The build's **resident**
memory does not, because every component that scales with input size is
either:

- **always on disk** (`strings`, `strings_offsets`, all sorted minimizer
runs, the merged minimizers file, the sparse-index compact_vectors, the
skew-index per-partition kmer/positions files, the codewords MPHF and
per-partition MPHFs), or
- **bounded by a working buffer** sized as a fraction of `ram_limit`:

| Buffer | Cap |
|----------------------------------------|--------------------------|
| Step 2 per-thread minimizer buffer | `ram_limit / 2 / num_threads` |
| Step 5 hashing buffer | `min(ram/8, RAM_available/3)` |
| Step 7.2 kmer-request external sort | `ram_limit / 8` |
| Step 7.2 phase-C `position_tuple` sort | `ram_limit / 8` |
| pthash external-memory builds | `ram_limit / 2` (its own `ram` field) |
| Every disk-backed reader/writer | `default_buffer_records ≈ 32 KiB` |
| Every external merge front (per run) | `4096 · sizeof(T)` |

The fractions (`/2` for the dominant per-step buffer, `/8` for buffers
that span step boundaries) are chosen so that overlapping allocations and
heap fragmentation between steps stay under `ram_limit` in practice. There
is a hard floor of `min_ram_limit_in_GiB` (enforced in
`validate_and_normalize_build_config`) below which step 4's MPHF builder
no longer has enough room to make progress.

The result: peak RSS during the build is governed by `-g`, not by
the input size or by the on-disk index size, and the saved index is
identical (byte-for-byte) to one written by an in-RAM builder followed by
`essentials::save`.
2 changes: 1 addition & 1 deletion external/pthash
6 changes: 6 additions & 0 deletions include/buckets_statistics.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ struct buckets_statistics {
uint64_t max_bucket_size() const { return m_max_bucket_size; }
uint64_t max_sparse_buckets_per_size() const { return m_max_sparse_buckets_per_size; }

/* Histogram bin: number of buckets whose size equals `s`. Bins beyond
MAX_BUCKET_SIZE are not tracked individually and return 0. */
uint64_t num_buckets_of_size(uint64_t s) const {
return s < m_bucket_sizes.size() ? m_bucket_sizes[s] : uint64_t(0);
}

void print_full() const {
std::cout << "=== bucket statistics (full) === \n";
for (uint64_t bucket_size = 1, prev_bucket_size = 0, prev_kmers_in_buckets = 0,
Expand Down
103 changes: 103 additions & 0 deletions include/builder/buffered_record_stream.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
#pragma once

#include <algorithm>
#include <cassert>
#include <cstdint>
#include <fstream>
#include <stdexcept>
#include <string>
#include <vector>

namespace sshash {

/*
A small buffered, forward-only reader of fixed-size records over a
binary file. Records are read in fixed-capacity chunks (~`buffer_records
* sizeof(T)` bytes of RAM) so the per-instance footprint is bounded
independently of the file size.

Used as the underlying primitive by all of SSHash's builder readers
over on-disk record files (minimizer tuples, kmer requests, kmer
records, offset values, sorted-run records, etc.). The class is
move-only; for callers that need a copyable forward iterator (e.g.
pthash's `build_in_external_memory`, which takes an iterator by
value), wrap an instance in a `std::shared_ptr`.

Usage:
buffered_record_stream<T> s;
s.open(filename);
while (!s.empty()) {
consume(s.current());
s.advance();
}
s.close();
*/
template <typename T>
struct buffered_record_stream {
static constexpr uint64_t default_buffer_records = 4096;

buffered_record_stream() = default;
buffered_record_stream(buffered_record_stream const&) = delete;
buffered_record_stream& operator=(buffered_record_stream const&) = delete;
buffered_record_stream(buffered_record_stream&&) = default;
buffered_record_stream& operator=(buffered_record_stream&&) = default;

/* Open `filename` for forward reading; optionally seek to byte
`start_byte` before priming the read window. */
void open(std::string const& filename, uint64_t buffer_records = default_buffer_records,
std::streamoff start_byte = 0) {
m_buf.resize(std::max<uint64_t>(1, buffer_records));
m_in.open(filename, std::ifstream::binary);
if (!m_in.is_open()) { throw std::runtime_error("cannot open file '" + filename + "'"); }
if (start_byte != 0) m_in.seekg(start_byte, std::ios::beg);
m_pos = 0;
m_size = 0;
m_eof = false;
refill();
}

void close() {
if (m_in.is_open()) m_in.close();
m_buf.clear();
m_buf.shrink_to_fit();
m_pos = 0;
m_size = 0;
m_eof = true;
}

bool is_open() const { return m_in.is_open(); }

/* True iff there are no more records in the stream. */
bool empty() const { return m_pos >= m_size; }

/* Reference to the current record. Valid until the next `advance()`. */
T const& current() const {
assert(!empty());
return m_buf[m_pos];
}

/* Move to the next record; refills the buffer from disk on demand. */
void advance() {
assert(!empty());
++m_pos;
if (m_pos >= m_size && !m_eof) refill();
}

private:
std::ifstream m_in;
std::vector<T> m_buf;
uint64_t m_pos = 0;
uint64_t m_size = 0;
bool m_eof = true;

void refill() {
m_pos = 0;
m_in.read(reinterpret_cast<char*>(m_buf.data()),
static_cast<std::streamsize>(m_buf.size() * sizeof(T)));
const std::streamsize got = m_in.gcount();
m_size = static_cast<uint64_t>(got) / sizeof(T);
if (m_size == 0) m_eof = true;
}
};

} // namespace sshash
Loading
Loading