Merge upstream master with metagraph compatibility#86
Closed
adamant-pwn wants to merge 13 commits intojermp:masterfrom
Closed
Merge upstream master with metagraph compatibility#86adamant-pwn wants to merge 13 commits intojermp:masterfrom
adamant-pwn wants to merge 13 commits intojermp:masterfrom
Conversation
This merge brings in all upstream changes from jermp/sshash, including: - Major refactoring with dictionary_builder class - New parse_file.cpp, compute_minimizer_tuples.cpp, build_sparse_and_skew_index.cpp - Updated benchmarks and scripts - cityhash integration - Many bug fixes and improvements Notable preserved ratschlab changes: - SSHASH_BUILD_EXECUTABLES option in CMakeLists.txt (commit 64d8894) Updated pthash submodule to upstream version.
878dc13 to
1e8ba29
Compare
…patibility Changes for metagraph integration: 1. Wrap cityhash in namespace to avoid typedef pollution - cityhash.hpp and cityhash.cpp: Wrapped in 'cityhash' namespace - Prevents conflicts with rollinghashcpp uint64 typedef - hash_util.hpp: Use cityhash::CityHash128WithSeed instead of static CityMurmur - tools/query.cpp: Changed uint64 to uint64_t (standard type) 2. Add public accessor methods to dictionary - Added kmer_type typedef for easier template parameter extraction - Added strings() accessor returning m_spss.strings bit vector - Added strings_offsets() accessor returning m_spss.strings_offsets - Moved strings_offsets() next to string_offsets() for better organization 3. Build system changes - CMakeLists.txt: Added cityhash.cpp to SSHASH_SOURCES - CMakeLists.txt: Link sshash executable against sshash_static (includes cityhash) 4. Update pthash submodule - Now tracking ratschlab/pthash master with compute_empirical_entropy fix - .gitmodules: Updated URL to point to ratschlab/pthash
1e8ba29 to
8815360
Compare
Fix two critical issues discovered during metagraph integration:
1. Sparse index encoding bug:
- The sparse index encoder uses list_id in control codewords for
buckets of size 2-64, but was allocating bits based on offset
size (num_bits_per_offset + 1) instead of list_id range.
- list_id counts buckets within each size category and can reach
the total number of minimizers in worst case (all same size).
- Fixed by tracking max_buckets_per_size during statistics
collection and using: max(num_bits_per_offset + 1,
2 + 6 + ceil(log2(max_buckets_per_size + 1)))
- This provides tight bound with minimal overhead (~2-3% worst case).
2. Verbose flag handling:
- build_stats.print() was called unconditionally in
dictionary_builder.hpp, ignoring build_config.verbose flag.
- Fixed by making print() conditional: if (build_config.verbose)
- This respects the library's own configuration mechanism.
Implementation:
- Added m_max_buckets_per_size member to buckets_statistics class
- Modified add_bucket_size() to track maximum incrementally
- Zero time overhead (uses std::max with existing data)
- Clean encapsulation within statistics class
Contributor
Author
|
Sorry, I think this PR was created accidentally, see #87 for the relecant changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.