Skip to content

Fix sparse index bug and improve library integration#87

Merged
jermp merged 2 commits intojermp:masterfrom
ratschlab:upstream-bug-fixes
Feb 23, 2026
Merged

Fix sparse index bug and improve library integration#87
jermp merged 2 commits intojermp:masterfrom
ratschlab:upstream-bug-fixes

Conversation

@adamant-pwn
Copy link
Copy Markdown
Contributor

This commit fixes a bug and adds improvements for better integration as a library dependency:

  1. Fix sparse index bit budget calculation:

    • The sparse index encoder uses list_id in control codewords for buckets of size 2-64, but was allocating bits based only on num_bits_per_offset + 1 (offset size).
    • list_id is a counter within each bucket size category that can reach the total number of minimizers in worst case (when all minimizers fall into buckets of the same size).
    • This mismatch causes assertion failures on repetitive sequences.
    • Fixed by tracking max_buckets_per_size and ensuring sufficient bits: max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
    • No behavior change on typical datasets: if the previous estimate was sufficient, the new one will be smaller and use the old value. For datasets where the assert failed, this fix enables support.
    • Implementation uses incremental tracking in buckets_statistics via std::max (zero time overhead, no additional pass).
  2. Make verbose output respect build_config flag:

    • build_stats.print() was called unconditionally in dictionary_builder.hpp.
    • Now respects build_config.verbose flag for better control.
  3. Namespace cityhash to avoid typedef conflicts:

    • cityhash defines a uint64 typedef that can conflict with other libraries (we encountered conflicts with rollinghashcpp's uint64).
    • Wrapped cityhash in 'cityhash' namespace to prevent pollution.
    • Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead of the static CityMurmur function, which is mathematically equivalent for the hash sizes used in sshash.
    • Updated tools/query.cpp to use standard uint64_t type.
  4. Add dictionary accessor methods:

    • Added strings() and strings_offsets() accessors to enable custom operations on dictionary internals without exposing all members.
  5. Add SSHASH_BUILD_EXECUTABLES build option:

    • New CMake option (default ON) allows building sshash as a library without executables, improving integration as a dependency.
    • Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
    • Wrapped all executable builds in conditional for flexibility.

This commit fixes a bug and adds improvements for better
integration as a library dependency:

1. Fix sparse index bit budget calculation:
   - The sparse index encoder uses list_id in control codewords for
     buckets of size 2-64, but was allocating bits based only on
     num_bits_per_offset + 1 (offset size).
   - list_id is a counter within each bucket size category that can
     reach the total number of minimizers in worst case (when all
     minimizers fall into buckets of the same size).
   - This mismatch causes assertion failures on repetitive sequences.
   - Fixed by tracking max_buckets_per_size and ensuring sufficient bits:
     max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
   - No behavior change on typical datasets: if the previous estimate
     was sufficient, the new one will be smaller and use the old value.
     For datasets where the assert failed, this fix enables support.
   - Implementation uses incremental tracking in buckets_statistics
     via std::max (zero time overhead, no additional pass).

2. Make verbose output respect build_config flag:
   - build_stats.print() was called unconditionally in dictionary_builder.hpp.
   - Now respects build_config.verbose flag for better control.

3. Namespace cityhash to avoid typedef conflicts:
   - cityhash defines a uint64 typedef that can conflict with other
     libraries (we encountered conflicts with rollinghashcpp's uint64).
   - Wrapped cityhash in 'cityhash' namespace to prevent pollution.
   - Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead
     of the static CityMurmur function, which is mathematically equivalent
     for the hash sizes used in sshash.
   - Updated tools/query.cpp to use standard uint64_t type.

4. Add dictionary accessor methods:
   - Added strings() and strings_offsets() accessors to enable custom
     operations on dictionary internals without exposing all members.

5. Add SSHASH_BUILD_EXECUTABLES build option:
   - New CMake option (default ON) allows building sshash as a library
     without executables, improving integration as a dependency.
   - Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
   - Wrapped all executable builds in conditional for flexibility.
@jermp
Copy link
Copy Markdown
Owner

jermp commented Feb 16, 2026

Hi Oleksandr, and thank for this. About the "bug" that you reported: I've actually commented on that scenario in Sect. 6 (footnote 4) of the preprint https://jermp.github.io/assets/pdf/papers/2026.01.21.700884v1.full.pdf, where I proposed to adapt min_l to allocate more bits for list_id. That would need, however, to change a constants and recompile the library. I'm curious about the pathological cases that you tested to make that fail :)

Anyway, will merge it soon.

@adamant-pwn
Copy link
Copy Markdown
Contributor Author

The particular test that failed is constructed as follows:

const std::vector<std::string> sequences {
    std::string(100, 'A') + std::string(k, 'C'),
    std::string(100, 'T') + std::string(2, 'G') + std::string(100, 'N'),
    std::string(100, 'G') + std::string(2, 'N') + std::string(100, 'A')
};

I think, just a single sequence like std::string(100, 'A') could fail it as well.

@jermp
Copy link
Copy Markdown
Owner

jermp commented Feb 16, 2026

Sure, you can obviously build a nasty artificial example that violates the statistical properties of minimizers.
However, I'm not sure if the test is representative of SSHash's usecase, especially because that string set in the example will contain duplicate kmers for k < 100...

@adamant-pwn
Copy link
Copy Markdown
Contributor Author

adamant-pwn commented Feb 16, 2026

I understand. Sorry if the term "bug" seems too detached from how unlikely this is to happen on practical datasets 🥲

We have a lot of those "worse and weirder than it happens in practice" sequences in our unit / integration tests (both because we want to avoid any unexpected failures even in "degenerate" inputs, whenever possible, and because it's often easier to manually verify what "correct" output is), so whenever I need to update a submodule, it typically starts with reconciling updates with any changes in our test results 😅

I hope in any case the change should be useful to you, as if for whatever reason a "bad" input ever pops up in practice, it will be processed smoothly, rather than trigger an assert and require increasing min_l and recompiling.

@jermp
Copy link
Copy Markdown
Owner

jermp commented Feb 16, 2026

Yes, it is absolutely useful :) I wonder, however, if two passes rather than one can have some impact on large collections.

@adamant-pwn
Copy link
Copy Markdown
Contributor Author

adamant-pwn commented Feb 16, 2026

It shouldn't be a problem in a sense that you already had two passes, I just moved the part that depends on knowing control codeword max width into the second pass, so that the actual needed width can be computed between them.

@jermp
Copy link
Copy Markdown
Owner

jermp commented Feb 16, 2026

Ohh, I see; you're right. Perfect, than.

Comment thread include/buckets_statistics.hpp Outdated
- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity
- Only update counter for bucket_size > 1 (buckets that use list_id encoding)
- Size-1 buckets use direct offset encoding, not the sparse index
@jermp jermp merged commit 946c8c4 into jermp:master Feb 23, 2026
9 checks passed
@jermp
Copy link
Copy Markdown
Owner

jermp commented Feb 23, 2026

Thanks again Oleksandr. Just merged now that I'm back from a confrence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants