Skip to content

Merge upstream master with metagraph compatibility#86

Closed
adamant-pwn wants to merge 13 commits intojermp:masterfrom
ratschlab:merge-upstream-master-with-metagraph-compatibility
Closed

Merge upstream master with metagraph compatibility#86
adamant-pwn wants to merge 13 commits intojermp:masterfrom
ratschlab:merge-upstream-master-with-metagraph-compatibility

Conversation

@adamant-pwn
Copy link
Copy Markdown
Contributor

No description provided.

adamant-pwn and others added 11 commits October 5, 2025 14:38
This merge brings in all upstream changes from jermp/sshash, including:
- Major refactoring with dictionary_builder class
- New parse_file.cpp, compute_minimizer_tuples.cpp, build_sparse_and_skew_index.cpp
- Updated benchmarks and scripts
- cityhash integration
- Many bug fixes and improvements

Notable preserved ratschlab changes:
- SSHASH_BUILD_EXECUTABLES option in CMakeLists.txt (commit 64d8894)

Updated pthash submodule to upstream version.
@adamant-pwn adamant-pwn force-pushed the merge-upstream-master-with-metagraph-compatibility branch 3 times, most recently from 878dc13 to 1e8ba29 Compare February 16, 2026 13:14
…patibility

Changes for metagraph integration:

1. Wrap cityhash in namespace to avoid typedef pollution
   - cityhash.hpp and cityhash.cpp: Wrapped in 'cityhash' namespace
   - Prevents conflicts with rollinghashcpp uint64 typedef
   - hash_util.hpp: Use cityhash::CityHash128WithSeed instead of static CityMurmur
   - tools/query.cpp: Changed uint64 to uint64_t (standard type)

2. Add public accessor methods to dictionary
   - Added kmer_type typedef for easier template parameter extraction
   - Added strings() accessor returning m_spss.strings bit vector
   - Added strings_offsets() accessor returning m_spss.strings_offsets
   - Moved strings_offsets() next to string_offsets() for better organization

3. Build system changes
   - CMakeLists.txt: Added cityhash.cpp to SSHASH_SOURCES
   - CMakeLists.txt: Link sshash executable against sshash_static (includes cityhash)

4. Update pthash submodule
   - Now tracking ratschlab/pthash master with compute_empirical_entropy fix
   - .gitmodules: Updated URL to point to ratschlab/pthash
@adamant-pwn adamant-pwn force-pushed the merge-upstream-master-with-metagraph-compatibility branch from 1e8ba29 to 8815360 Compare February 16, 2026 14:10
Fix two critical issues discovered during metagraph integration:

1. Sparse index encoding bug:
   - The sparse index encoder uses list_id in control codewords for
     buckets of size 2-64, but was allocating bits based on offset
     size (num_bits_per_offset + 1) instead of list_id range.
   - list_id counts buckets within each size category and can reach
     the total number of minimizers in worst case (all same size).
   - Fixed by tracking max_buckets_per_size during statistics
     collection and using: max(num_bits_per_offset + 1,
     2 + 6 + ceil(log2(max_buckets_per_size + 1)))
   - This provides tight bound with minimal overhead (~2-3% worst case).

2. Verbose flag handling:
   - build_stats.print() was called unconditionally in
     dictionary_builder.hpp, ignoring build_config.verbose flag.
   - Fixed by making print() conditional: if (build_config.verbose)
   - This respects the library's own configuration mechanism.

Implementation:
- Added m_max_buckets_per_size member to buckets_statistics class
- Modified add_bucket_size() to track maximum incrementally
- Zero time overhead (uses std::max with existing data)
- Clean encapsulation within statistics class
@adamant-pwn
Copy link
Copy Markdown
Contributor Author

adamant-pwn commented Feb 16, 2026

Sorry, I think this PR was created accidentally, see #87 for the relecant changes.

@adamant-pwn adamant-pwn deleted the merge-upstream-master-with-metagraph-compatibility branch February 16, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants