Skip to content

Bring dev up to date with master#88

Merged
jermp merged 159 commits intodevfrom
master
Mar 9, 2026
Merged

Bring dev up to date with master#88
jermp merged 159 commits intodevfrom
master

Conversation

@jermp
Copy link
Copy Markdown
Owner

@jermp jermp commented Mar 9, 2026

No description provided.

hmusta and others added 30 commits December 9, 2024 12:16
…o_uint_kmer by setting chars rather than appending
jermp and others added 29 commits November 12, 2025 21:11
swap locate query with string checking in lookup
This commit fixes a bug and adds improvements for better
integration as a library dependency:

1. Fix sparse index bit budget calculation:
   - The sparse index encoder uses list_id in control codewords for
     buckets of size 2-64, but was allocating bits based only on
     num_bits_per_offset + 1 (offset size).
   - list_id is a counter within each bucket size category that can
     reach the total number of minimizers in worst case (when all
     minimizers fall into buckets of the same size).
   - This mismatch causes assertion failures on repetitive sequences.
   - Fixed by tracking max_buckets_per_size and ensuring sufficient bits:
     max(num_bits_per_offset + 1, 2 + 6 + ceil(log2(max_buckets_per_size + 1)))
   - No behavior change on typical datasets: if the previous estimate
     was sufficient, the new one will be smaller and use the old value.
     For datasets where the assert failed, this fix enables support.
   - Implementation uses incremental tracking in buckets_statistics
     via std::max (zero time overhead, no additional pass).

2. Make verbose output respect build_config flag:
   - build_stats.print() was called unconditionally in dictionary_builder.hpp.
   - Now respects build_config.verbose flag for better control.

3. Namespace cityhash to avoid typedef conflicts:
   - cityhash defines a uint64 typedef that can conflict with other
     libraries (we encountered conflicts with rollinghashcpp's uint64).
   - Wrapped cityhash in 'cityhash' namespace to prevent pollution.
   - Updated hash_util.hpp to use cityhash::CityHash128WithSeed instead
     of the static CityMurmur function, which is mathematically equivalent
     for the hash sizes used in sshash.
   - Updated tools/query.cpp to use standard uint64_t type.

4. Add dictionary accessor methods:
   - Added strings() and strings_offsets() accessors to enable custom
     operations on dictionary internals without exposing all members.

5. Add SSHASH_BUILD_EXECUTABLES build option:
   - New CMake option (default ON) allows building sshash as a library
     without executables, improving integration as a dependency.
   - Added CITYHASH_SOURCES and properly link cityhash into sshash_static.
   - Wrapped all executable builds in conditional for flexibility.
- Renamed m_max_buckets_per_size to m_max_sparse_buckets_per_size for clarity
- Only update counter for bucket_size > 1 (buckets that use list_id encoding)
- Size-1 buckets use direct offset encoding, not the sparse index
Fix sparse index bug and improve library integration
@jermp jermp merged commit 82d4ed3 into dev Mar 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants