Skip to content

Hublabel#1

Open
electricEpilith wants to merge 118 commits intomasterfrom
hublabel
Open

Hublabel#1
electricEpilith wants to merge 118 commits intomasterfrom
hublabel

Conversation

@electricEpilith
Copy link
Copy Markdown
Owner

Changelog Entry

To be copied to the draft changelog by merger:

  • Whatsits now frobnicated

Description

Hub labeling pull request test

electricEpilith and others added 30 commits November 12, 2025 14:32
faithokamoto and others added 27 commits April 23, 2026 11:26
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
planned out by Claude Opus 4.7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
planned by Claude Opus 4.7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix potentially very large loop when looking for rescue graph
Revise recombination, chain, and alignment scoring
Make `vg giraffe` error when the minimizers or zipcodes are older than the distance index
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 2, 2026

Copy link
Copy Markdown

@codacy-production codacy-production Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements significant Hub Labeling features and refactors the SnarlDistanceIndex. However, the PR is currently not up to Codacy standards. Key concerns include a critical lack of unit testing for several new core features: the log-scaled gap scoring logic, the 'evaluation bonus' heuristic in chaining, and the index validation logic.

Technically, there are several high-severity issues regarding uninitialized POD members in the clustering logic which could lead to non-deterministic behavior. Additionally, the new make_temporary_distance_index function exhibits extreme cyclomatic complexity (CCN 100), making it virtually untestable in its current form. These issues, combined with the empty PR description and removal of legacy Protobuf-handling functions without explicit deprecation, should be addressed before merging.

About this PR

  • The PR description is empty. Given the scope of changes—including C++20 migration, Hub Labeling implementation, and scoring refactors—a detailed summary of changes and design decisions is required.
  • The Hub Labeling implementation introduces a dependency on 'bdsg/ch.hpp'. Verify that the build environment and CI pipelines are updated to include this header.
1 comment outside of the diff
src/cluster.hpp

line 606 🔴 HIGH RISK
Suggestion: Mark this constructor as explicit to prevent implicit conversion from SnarlDistanceIndex to TipAnchoredMaxDistance.

    explicit TipAnchoredMaxDistance(SnarlDistanceIndex& distance_index);

Test suggestions

  • Verify Hub Labeling query accuracy in oversized snarls against Dijkstra ground truth.
  • Verify that snarl regularity checking correctly identifies simple, regular, and irregular snarls.
  • Test alignment rescoring logic with logged gap lengths for matches, mismatches, and various indel sizes.
  • Verify the 'evaluation bonus' heuristic in chaining preserves haplotype consistency compared to standard DP.
  • Ensure SnarlDecompositionFuzzer correctly replays and flips nested chain events.
  • Validate the index predates check in Giraffe correctly identifies stale indices.
Prompt proposal for missing tests
Consider implementing these tests if applicable:
1. Test alignment rescoring logic with logged gap lengths for matches, mismatches, and various indel sizes.
2. Verify the 'evaluation bonus' heuristic in chaining preserves haplotype consistency compared to standard DP.
3. Validate the index predates check in Giraffe correctly identifies stale indices.
Low confidence findings
  • Several sorting and cleaning functions in graph.cpp/hpp were removed. Ensure these are not part of the public API or that downstream external tools have been migrated to the HandleGraph equivalents.

TIP Improve review quality by adding custom instructions
TIP How was this review? Give us feedback

Comment thread src/cluster.hpp
class MEMClusterer::HitEdge {
public:
HitEdge(size_t to_idx, int32_t weight, int64_t distance) : to_idx(to_idx), weight(weight), distance(distance) {}
HitEdge() = default;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 HIGH RISK

The default constructor does not initialize the primitive members to_idx, weight, and distance. This can lead to garbage values being used during dynamic programming.

Suggested change
HitEdge() = default;
HitEdge() : to_idx(0), weight(0), distance(0) {}

See Issue in Codacy

Comment thread src/alignment.cpp
finish_gap();
}

int score_alignment_with_logged_gaps(const size_t& matches, const size_t& mismatches, const std::vector<size_t>& gap_lengths) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM RISK

The new scoring function 'score_alignment_with_logged_gaps' lacks corresponding unit tests in src/unittest/alignment.cpp to verify the log-scaling behavior against expected minimap2 scores for various indel sizes.

using namespace handlegraph;
namespace vg {

SnarlDistanceIndex::TemporaryDistanceIndex make_temporary_distance_index(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM RISK

This function is extremely complex (CCN 100). The logic for traversing the snarl decomposition and handling BEGIN/END events should be refactored into smaller, specialized private helper methods to improve testability and readability.

See Issue in Codacy

Comment thread src/gbwtgraph_helper.cpp

const handlegraph::HandleGraph* graph_ptr = (const handlegraph::HandleGraph*) &gbz.graph;

double total_zipcode_time = 0.0, total_decoder_time = 0.0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ LOW RISK

The variables total_zipcode_time and total_decoder_time are assigned a value but never used. Either implement the missing timing logic or remove them to clean up the code.

See Issue in Codacy

REQUIRE(distance_index.minimum_distance(node_id1, rev1, offset1, node_id2, rev2, offset2, false, &graph) == dijkstra_distance);
}

TEST_CASE( "Distance index can query out of a SNP with a reversing allele as an oversided snarl",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ LOW RISK

Nitpick: Typo in test case name: 'oversided' should be 'oversized'.

Comment thread src/index_registry.cpp
std::filesystem::file_time_type later_time = *std::max_element(later_times.begin(), later_times.end());

// Return if the earlier files are touched no later than the later files.
return earlier_time <= later_time;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ LOW RISK

Nitpick: The timestamp comparison logic correctly implements the requirement to ensure downstream indices are newer than their dependencies.

Comment thread src/alignment.hpp
pair<int64_t, int64_t> aligned_interval(const Alignment& aln);

/// Count the various types of edits in an Alignment, including individual gap lengths.
void count_alignment_operations(const Alignment& aln, size_t& matches, size_t& mismatches, std::vector<size_t>& gaps_lengths);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ LOW RISK

Nitpick: The parameter name gaps_lengths in the header is inconsistent with its usage (gap_lengths) in the implementation file.

@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 1 critical · 4 high · 42 medium · 21 minor

Alerts:
⚠ 68 issues (≤ 0 issues of at least minor severity)

Results:
68 new issues

Category Results
UnusedCode 6 medium
ErrorProne 4 high
Security 1 critical
CodeStyle 21 minor
Complexity 36 medium

View in Codacy

🟢 Metrics 2493 complexity · 1262 duplication

Metric Results
Complexity 2493
Duplication 1262

View in Codacy

AI Reviewer: first review requested successfully. AI can make mistakes. Always validate suggestions.

Run reviewer

TIP This summary will be updated as you push new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants