Hublabel#1
Conversation
… get a wrong answer
…bbdsg that makes labels that can fit
This reverts commit 5436d73.
…indexing test cases
…our generated headers
…e saving the min/zipcodes files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
planned out by Claude Opus 4.7 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
planned by Claude Opus 4.7 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix potentially very large loop when looking for rescue graph
Revise recombination, chain, and alignment scoring
remove unnecessary return info
Make `vg giraffe` error when the minimizers or zipcodes are older than the distance index
Missing transition checker
|
There was a problem hiding this comment.
Pull Request Overview
This PR implements significant Hub Labeling features and refactors the SnarlDistanceIndex. However, the PR is currently not up to Codacy standards. Key concerns include a critical lack of unit testing for several new core features: the log-scaled gap scoring logic, the 'evaluation bonus' heuristic in chaining, and the index validation logic.
Technically, there are several high-severity issues regarding uninitialized POD members in the clustering logic which could lead to non-deterministic behavior. Additionally, the new make_temporary_distance_index function exhibits extreme cyclomatic complexity (CCN 100), making it virtually untestable in its current form. These issues, combined with the empty PR description and removal of legacy Protobuf-handling functions without explicit deprecation, should be addressed before merging.
About this PR
- The PR description is empty. Given the scope of changes—including C++20 migration, Hub Labeling implementation, and scoring refactors—a detailed summary of changes and design decisions is required.
- The Hub Labeling implementation introduces a dependency on 'bdsg/ch.hpp'. Verify that the build environment and CI pipelines are updated to include this header.
1 comment outside of the diff
src/cluster.hpp
line 606🔴 HIGH RISK
Suggestion: Mark this constructor as explicit to prevent implicit conversion fromSnarlDistanceIndextoTipAnchoredMaxDistance.explicit TipAnchoredMaxDistance(SnarlDistanceIndex& distance_index);
Test suggestions
- Verify Hub Labeling query accuracy in oversized snarls against Dijkstra ground truth.
- Verify that snarl regularity checking correctly identifies simple, regular, and irregular snarls.
- Test alignment rescoring logic with logged gap lengths for matches, mismatches, and various indel sizes.
- Verify the 'evaluation bonus' heuristic in chaining preserves haplotype consistency compared to standard DP.
- Ensure SnarlDecompositionFuzzer correctly replays and flips nested chain events.
- Validate the index predates check in Giraffe correctly identifies stale indices.
Prompt proposal for missing tests
Consider implementing these tests if applicable:
1. Test alignment rescoring logic with logged gap lengths for matches, mismatches, and various indel sizes.
2. Verify the 'evaluation bonus' heuristic in chaining preserves haplotype consistency compared to standard DP.
3. Validate the index predates check in Giraffe correctly identifies stale indices.
Low confidence findings
- Several sorting and cleaning functions in graph.cpp/hpp were removed. Ensure these are not part of the public API or that downstream external tools have been migrated to the HandleGraph equivalents.
TIP Improve review quality by adding custom instructions
TIP How was this review? Give us feedback
| class MEMClusterer::HitEdge { | ||
| public: | ||
| HitEdge(size_t to_idx, int32_t weight, int64_t distance) : to_idx(to_idx), weight(weight), distance(distance) {} | ||
| HitEdge() = default; |
There was a problem hiding this comment.
🔴 HIGH RISK
The default constructor does not initialize the primitive members to_idx, weight, and distance. This can lead to garbage values being used during dynamic programming.
| HitEdge() = default; | |
| HitEdge() : to_idx(0), weight(0), distance(0) {} |
| finish_gap(); | ||
| } | ||
|
|
||
| int score_alignment_with_logged_gaps(const size_t& matches, const size_t& mismatches, const std::vector<size_t>& gap_lengths) { |
There was a problem hiding this comment.
🟡 MEDIUM RISK
The new scoring function 'score_alignment_with_logged_gaps' lacks corresponding unit tests in src/unittest/alignment.cpp to verify the log-scaling behavior against expected minimap2 scores for various indel sizes.
| using namespace handlegraph; | ||
| namespace vg { | ||
|
|
||
| SnarlDistanceIndex::TemporaryDistanceIndex make_temporary_distance_index( |
There was a problem hiding this comment.
🟡 MEDIUM RISK
This function is extremely complex (CCN 100). The logic for traversing the snarl decomposition and handling BEGIN/END events should be refactored into smaller, specialized private helper methods to improve testability and readability.
|
|
||
| const handlegraph::HandleGraph* graph_ptr = (const handlegraph::HandleGraph*) &gbz.graph; | ||
|
|
||
| double total_zipcode_time = 0.0, total_decoder_time = 0.0; |
There was a problem hiding this comment.
⚪ LOW RISK
The variables total_zipcode_time and total_decoder_time are assigned a value but never used. Either implement the missing timing logic or remove them to clean up the code.
| REQUIRE(distance_index.minimum_distance(node_id1, rev1, offset1, node_id2, rev2, offset2, false, &graph) == dijkstra_distance); | ||
| } | ||
|
|
||
| TEST_CASE( "Distance index can query out of a SNP with a reversing allele as an oversided snarl", |
There was a problem hiding this comment.
⚪ LOW RISK
Nitpick: Typo in test case name: 'oversided' should be 'oversized'.
| std::filesystem::file_time_type later_time = *std::max_element(later_times.begin(), later_times.end()); | ||
|
|
||
| // Return if the earlier files are touched no later than the later files. | ||
| return earlier_time <= later_time; |
There was a problem hiding this comment.
⚪ LOW RISK
Nitpick: The timestamp comparison logic correctly implements the requirement to ensure downstream indices are newer than their dependencies.
| pair<int64_t, int64_t> aligned_interval(const Alignment& aln); | ||
|
|
||
| /// Count the various types of edits in an Alignment, including individual gap lengths. | ||
| void count_alignment_operations(const Alignment& aln, size_t& matches, size_t& mismatches, std::vector<size_t>& gaps_lengths); |
There was a problem hiding this comment.
⚪ LOW RISK
Nitpick: The parameter name gaps_lengths in the header is inconsistent with its usage (gap_lengths) in the implementation file.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 6 medium |
| ErrorProne | 4 high |
| Security | 1 critical |
| CodeStyle | 21 minor |
| Complexity | 36 medium |
🟢 Metrics 2493 complexity · 1262 duplication
Metric Results Complexity 2493 Duplication 1262
AI Reviewer: first review requested successfully. AI can make mistakes. Always validate suggestions.
TIP This summary will be updated as you push new changes.



Changelog Entry
To be copied to the draft changelog by merger:
Description
Hub labeling pull request test