Fix sparse index bug and improve library integration by adamant-pwn · Pull Request #87 · jermp/sshash

adamant-pwn · 2026-02-16T19:41:51Z

This commit fixes a bug and adds improvements for better integration as a library dependency:

Fix sparse index bit budget calculation:
- The sparse index encoder uses list_id in control codewords for buckets of size 2-64, but was allocating bits based only on num_bits_per_offset + 1 (offset size).
- list_id is a counter within each bucket size category that can reach the total number of minimizers in worst case (when all minimizers fall into buckets of the same size).
- This mismatch causes assertion failures on repetitive sequences.
- Fixed by tracking max_buckets_per_size and ensuring sufficient bits: max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
- No behavior change on typical datasets: if the previous estimate was sufficient, the new one will be smaller and use the old value. For datasets where the assert failed, this fix enables support.
- Implementation uses incremental tracking in buckets_statistics via std::max (zero time overhead, no additional pass).
Make verbose output respect build_config flag:
- build_stats.print() was called unconditionally in dictionary_builder.hpp.
- Now respects build_config.verbose flag for better control.
Namespace cityhash to avoid typedef conflicts:
- cityhash defines a uint64 typedef that can conflict with other libraries (we encountered conflicts with rollinghashcpp's uint64).
- Wrapped cityhash in 'cityhash' namespace to prevent pollution.
- Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead of the static CityMurmur function, which is mathematically equivalent for the hash sizes used in sshash.
- Updated tools/query.cpp to use standard uint64_t type.
Add dictionary accessor methods:
- Added strings() and strings_offsets() accessors to enable custom operations on dictionary internals without exposing all members.
Add SSHASH_BUILD_EXECUTABLES build option:
- New CMake option (default ON) allows building sshash as a library without executables, improving integration as a dependency.
- Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
- Wrapped all executable builds in conditional for flexibility.

This commit fixes a bug and adds improvements for better integration as a library dependency: 1. Fix sparse index bit budget calculation: - The sparse index encoder uses list_id in control codewords for buckets of size 2-64, but was allocating bits based only on num_bits_per_offset + 1 (offset size). - list_id is a counter within each bucket size category that can reach the total number of minimizers in worst case (when all minimizers fall into buckets of the same size). - This mismatch causes assertion failures on repetitive sequences. - Fixed by tracking max_buckets_per_size and ensuring sufficient bits: max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1))) - No behavior change on typical datasets: if the previous estimate was sufficient, the new one will be smaller and use the old value. For datasets where the assert failed, this fix enables support. - Implementation uses incremental tracking in buckets_statistics via std::max (zero time overhead, no additional pass). 2. Make verbose output respect build_config flag: - build_stats.print() was called unconditionally in dictionary_builder.hpp. - Now respects build_config.verbose flag for better control. 3. Namespace cityhash to avoid typedef conflicts: - cityhash defines a uint64 typedef that can conflict with other libraries (we encountered conflicts with rollinghashcpp's uint64). - Wrapped cityhash in 'cityhash' namespace to prevent pollution. - Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead of the static CityMurmur function, which is mathematically equivalent for the hash sizes used in sshash. - Updated tools/query.cpp to use standard uint64_t type. 4. Add dictionary accessor methods: - Added strings() and strings_offsets() accessors to enable custom operations on dictionary internals without exposing all members. 5. Add SSHASH_BUILD_EXECUTABLES build option: - New CMake option (default ON) allows building sshash as a library without executables, improving integration as a dependency. - Added CITYHASH_SOURCES and properly link cityhash into sshash_static. - Wrapped all executable builds in conditional for flexibility.

jermp · 2026-02-16T20:11:05Z

Hi Oleksandr, and thank for this. About the "bug" that you reported: I've actually commented on that scenario in Sect. 6 (footnote 4) of the preprint https://jermp.github.io/assets/pdf/papers/2026.01.21.700884v1.full.pdf, where I proposed to adapt min_l to allocate more bits for list_id. That would need, however, to change a constants and recompile the library. I'm curious about the pathological cases that you tested to make that fail :)

Anyway, will merge it soon.

adamant-pwn · 2026-02-16T20:39:38Z

The particular test that failed is constructed as follows:

const std::vector<std::string> sequences {
    std::string(100, 'A') + std::string(k, 'C'),
    std::string(100, 'T') + std::string(2, 'G') + std::string(100, 'N'),
    std::string(100, 'G') + std::string(2, 'N') + std::string(100, 'A')
};

I think, just a single sequence like std::string(100, 'A') could fail it as well.

jermp · 2026-02-16T20:56:57Z

Sure, you can obviously build a nasty artificial example that violates the statistical properties of minimizers.
However, I'm not sure if the test is representative of SSHash's usecase, especially because that string set in the example will contain duplicate kmers for k < 100...

adamant-pwn · 2026-02-16T21:09:40Z

I understand. Sorry if the term "bug" seems too detached from how unlikely this is to happen on practical datasets 🥲

We have a lot of those "worse and weirder than it happens in practice" sequences in our unit / integration tests (both because we want to avoid any unexpected failures even in "degenerate" inputs, whenever possible, and because it's often easier to manually verify what "correct" output is), so whenever I need to update a submodule, it typically starts with reconciling updates with any changes in our test results 😅

I hope in any case the change should be useful to you, as if for whatever reason a "bad" input ever pops up in practice, it will be processed smoothly, rather than trigger an assert and require increasing min_l and recompiling.

jermp · 2026-02-16T21:23:53Z

Yes, it is absolutely useful :) I wonder, however, if two passes rather than one can have some impact on large collections.

adamant-pwn · 2026-02-16T21:27:57Z

It shouldn't be a problem in a sense that you already had two passes, I just moved the part that depends on knowing control codeword max width into the second pass, so that the actual needed width can be computed between them.

jermp · 2026-02-16T21:30:46Z

Ohh, I see; you're right. Perfect, than.

- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity - Only update counter for bucket_size > 1 (buckets that use list_id encoding) - Size-1 buckets use direct offset encoding, not the sparse index

jermp · 2026-02-23T12:19:27Z

Thanks again Oleksandr. Just merged now that I'm back from a confrence.

adamant-pwn mentioned this pull request Feb 16, 2026

Merge upstream master with metagraph compatibility #86

Closed

adamant-pwn commented Feb 16, 2026

View reviewed changes

Comment thread include/buckets_statistics.hpp Outdated

Refine sparse index fix: track only buckets with size > 1

7fdd93a

- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity - Only update counter for bucket_size > 1 (buckets that use list_id encoding) - Size-1 buckets use direct offset encoding, not the sparse index

jermp merged commit 946c8c4 into jermp:master Feb 23, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sparse index bug and improve library integration#87

Fix sparse index bug and improve library integration#87
jermp merged 2 commits intojermp:masterfrom
ratschlab:upstream-bug-fixes

adamant-pwn commented Feb 16, 2026

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026 •

edited

Loading

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026 •

edited

Loading

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

Uh oh!

Uh oh!

jermp commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adamant-pwn commented Feb 16, 2026

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

adamant-pwn commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Feb 16, 2026

Uh oh!

Uh oh!

Uh oh!

jermp commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamant-pwn commented Feb 16, 2026 •

edited

Loading

adamant-pwn commented Feb 16, 2026 •

edited

Loading