Fix sparse index bug and improve library integration#87
Conversation
This commit fixes a bug and adds improvements for better
integration as a library dependency:
1. Fix sparse index bit budget calculation:
- The sparse index encoder uses list_id in control codewords for
buckets of size 2-64, but was allocating bits based only on
num_bits_per_offset + 1 (offset size).
- list_id is a counter within each bucket size category that can
reach the total number of minimizers in worst case (when all
minimizers fall into buckets of the same size).
- This mismatch causes assertion failures on repetitive sequences.
- Fixed by tracking max_buckets_per_size and ensuring sufficient bits:
max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
- No behavior change on typical datasets: if the previous estimate
was sufficient, the new one will be smaller and use the old value.
For datasets where the assert failed, this fix enables support.
- Implementation uses incremental tracking in buckets_statistics
via std::max (zero time overhead, no additional pass).
2. Make verbose output respect build_config flag:
- build_stats.print() was called unconditionally in dictionary_builder.hpp.
- Now respects build_config.verbose flag for better control.
3. Namespace cityhash to avoid typedef conflicts:
- cityhash defines a uint64 typedef that can conflict with other
libraries (we encountered conflicts with rollinghashcpp's uint64).
- Wrapped cityhash in 'cityhash' namespace to prevent pollution.
- Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead
of the static CityMurmur function, which is mathematically equivalent
for the hash sizes used in sshash.
- Updated tools/query.cpp to use standard uint64_t type.
4. Add dictionary accessor methods:
- Added strings() and strings_offsets() accessors to enable custom
operations on dictionary internals without exposing all members.
5. Add SSHASH_BUILD_EXECUTABLES build option:
- New CMake option (default ON) allows building sshash as a library
without executables, improving integration as a dependency.
- Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
- Wrapped all executable builds in conditional for flexibility.
|
Hi Oleksandr, and thank for this. About the "bug" that you reported: I've actually commented on that scenario in Sect. 6 (footnote 4) of the preprint https://jermp.github.io/assets/pdf/papers/2026.01.21.700884v1.full.pdf, where I proposed to adapt Anyway, will merge it soon. |
|
The particular test that failed is constructed as follows: const std::vector<std::string> sequences {
std::string(100, 'A') + std::string(k, 'C'),
std::string(100, 'T') + std::string(2, 'G') + std::string(100, 'N'),
std::string(100, 'G') + std::string(2, 'N') + std::string(100, 'A')
};I think, just a single sequence like |
|
Sure, you can obviously build a nasty artificial example that violates the statistical properties of minimizers. |
|
I understand. Sorry if the term "bug" seems too detached from how unlikely this is to happen on practical datasets 🥲 We have a lot of those "worse and weirder than it happens in practice" sequences in our unit / integration tests (both because we want to avoid any unexpected failures even in "degenerate" inputs, whenever possible, and because it's often easier to manually verify what "correct" output is), so whenever I need to update a submodule, it typically starts with reconciling updates with any changes in our test results 😅 I hope in any case the change should be useful to you, as if for whatever reason a "bad" input ever pops up in practice, it will be processed smoothly, rather than trigger an assert and require increasing |
|
Yes, it is absolutely useful :) I wonder, however, if two passes rather than one can have some impact on large collections. |
|
It shouldn't be a problem in a sense that you already had two passes, I just moved the part that depends on knowing control codeword max width into the second pass, so that the actual needed width can be computed between them. |
|
Ohh, I see; you're right. Perfect, than. |
- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity - Only update counter for bucket_size > 1 (buckets that use list_id encoding) - Size-1 buckets use direct offset encoding, not the sparse index
|
Thanks again Oleksandr. Just merged now that I'm back from a confrence. |
This commit fixes a bug and adds improvements for better integration as a library dependency:
Fix sparse index bit budget calculation:
Make verbose output respect build_config flag:
Namespace cityhash to avoid typedef conflicts:
Add dictionary accessor methods:
Add SSHASH_BUILD_EXECUTABLES build option: