Conversation
…ueries (for a tiny space increase)
…o_uint_kmer by setting chars rather than appending
swap locate query with string checking in lookup
This commit fixes a bug and adds improvements for better
integration as a library dependency:
1. Fix sparse index bit budget calculation:
- The sparse index encoder uses list_id in control codewords for
buckets of size 2-64, but was allocating bits based only on
num_bits_per_offset + 1 (offset size).
- list_id is a counter within each bucket size category that can
reach the total number of minimizers in worst case (when all
minimizers fall into buckets of the same size).
- This mismatch causes assertion failures on repetitive sequences.
- Fixed by tracking max_buckets_per_size and ensuring sufficient bits:
max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
- No behavior change on typical datasets: if the previous estimate
was sufficient, the new one will be smaller and use the old value.
For datasets where the assert failed, this fix enables support.
- Implementation uses incremental tracking in buckets_statistics
via std::max (zero time overhead, no additional pass).
2. Make verbose output respect build_config flag:
- build_stats.print() was called unconditionally in dictionary_builder.hpp.
- Now respects build_config.verbose flag for better control.
3. Namespace cityhash to avoid typedef conflicts:
- cityhash defines a uint64 typedef that can conflict with other
libraries (we encountered conflicts with rollinghashcpp's uint64).
- Wrapped cityhash in 'cityhash' namespace to prevent pollution.
- Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead
of the static CityMurmur function, which is mathematically equivalent
for the hash sizes used in sshash.
- Updated tools/query.cpp to use standard uint64_t type.
4. Add dictionary accessor methods:
- Added strings() and strings_offsets() accessors to enable custom
operations on dictionary internals without exposing all members.
5. Add SSHASH_BUILD_EXECUTABLES build option:
- New CMake option (default ON) allows building sshash as a library
without executables, improving integration as a dependency.
- Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
- Wrapped all executable builds in conditional for flexibility.
- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity - Only update counter for bucket_size > 1 (buckets that use list_id encoding) - Size-1 buckets use direct offset encoding, not the sparse index
Fix sparse index bug and improve library integration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.