Hublabel by electricEpilith · Pull Request #4870 · vgteam/vg

electricEpilith · 2026-04-07T00:15:52Z

Changelog Entry

To be copied to the draft changelog by merger:

Add hub labeling to distance index, which allows efficient exact shortest distance queries even in "oversized" snarls
Bug fix for minimizer, significant speed improvement

Description

Adds hub labeling functionality to the snarl distance index.

…++20 upgrade

…he wrong thing

… get a wrong answer

…bbdsg that makes labels that can fit

…d asserts

This reverts commit 5436d73.

…raph objects

…indexing test cases

planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: GitHub Copilot <noreply@github.com>

This reverts commit 7920c76.

This reverts commit b81e331.

initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>

This reverts commit 6cf266c.

Hublabel

adamnovak

Here's my review, including a bunch of stuff I now want to change about code I committed.

The changes to libbdsg also need to be reviewed.

adamnovak · 2026-04-07T21:39:43Z

+        // re-preload right before cache_payloads. The double-preload is
+        // necessary: a single preload just before cache_payloads isn't enough
+        // to keep the index resident under the memory pressure of 32 parallel
+        // threads and the remaining in-memory data structures.


I don't think that's how page caching works; if it's paged in it's paged in, right? It can't possibly get more paged in if you add a second, earlier copy of the step where it got paged in.

We shouldn't be doing magic; if somehow this genuinely does improve performance we need to be able to explain why, in terms of something like a page eviction algorithm we can link to that's trying to be clever and really does care how long something has been cached.

This actually improves speed (at the cost of more memory usage). idk why yet

adamnovak · 2026-04-07T21:46:14Z

+
+    // Step 2: Build pairing vector mapping each begin to its matching end
+    // and vice versa, using separate stacks for chains and snarls.
+    std::vector<size_t> pair_of(events.size());


pair_of is a bad name for this, it ought to be something like other_end.

(I'm pretty sure I put it in though.)

adamnovak · 2026-04-07T22:21:12Z

+        // page cache. We also preload eagerly right after loading the index (in
+        // minimizer_main.cpp) so the kernel treats those pages as recently-used;
+        // together the two preloads prevent cache_payloads from page-faulting on
+        // every node under the memory pressure of 32 parallel threads.


This doesn't really explain how two passes of preloading could possibly help, either.

adamnovak · 2026-04-07T22:22:55Z

+ * - Normal snarl: all rows
+ * - Oversized snarl: boundaries and tips
+ * - size_limit == 0: no distances in index, so no rows
+ * - Top-level chain distances only: ??? 


It would be good to figure this out and fill this in.

adamnovak · 2026-04-07T22:25:58Z

+#ifdef debug_hub_label_build
+  // Dump CHOverlay graph to stderr for debugging
+  std::cerr << "=== CHOverlay Graph Dump ===" << std::endl;
+  std::cerr << "Vertices: " << num_vertices(ov) << ", Edges: " << num_edges(ov) << std::endl;
+  std::cerr << "--- Nodes ---" << std::endl;
+  for (auto v : boost::make_iterator_range(vertices(ov))) {
+    const NodeProp& np = ov[v];
+    std::cerr << "Node " << v << ": seqlen=" << np.seqlen
+              << " max_out=" << np.max_out
+              << " contracted_neighbors=" << np.contracted_neighbors
+              << " level=" << np.level
+              << " arc_cover=" << np.arc_cover
+              << " contracted=" << (np.contracted ? "true" : "false")
+              // Skip new_id since it is not initialized until make_contraction_hierarchy is run.
+              << std::endl;
+  }
+  std::cerr << "--- Edges ---" << std::endl;
+  for (auto e : boost::make_iterator_range(edges(ov))) {
+    const EdgeProp& ep = ov[e];
+    std::cerr << "Edge " << source(e, ov) << " -> " << target(e, ov)
+              << ": contracted=" << (ep.contracted ? "true" : "false")
+              << " weight=" << ep.weight
+              << " arc_cover=" << ep.arc_cover
+              << " ori=" << (ep.ori ? "true" : "false") << std::endl;
+  }
+  std::cerr << "=== End CHOverlay Dump ===" << std::endl;
+#endif


I think I put it it here, but this could stand to become a CHOverlay method or debug function.

adamnovak · 2026-04-07T22:29:53Z

+        if ( (temp_snarl_record.node_count > size_limit || size_limit == 0 || only_top_level_chain_distances) && (temp_snarl_record.is_root_snarl || start_normal_child)) {
            //If we don't care about internal distances, and we also are not at a boundary or tip
            //TODO: Why do we care about tips specifically?
            continue;
        }
+        //getting here means snarl is not oversized


We only even get into this function now if we're not oversized, or if size_limit is 0 and we're not including distances at all. This should be reworked to understand that.

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

…ier to write

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

adamnovak · 2026-05-07T17:10:09Z

This goes with vgteam/libbdsg#239, for reference.

adamnovak

I like the idea of breaking up all the stuff in snarl_distance_index.cpp into more distinct units. But I think the design and especially documentation of the units needs work, and we also need some overarching organization of those units that isn't just several cpp files that all belong to one header.

I think some of these pieces might make sense as a few different algorithms in vg::algorithms. There each algorithm could have its own header defining the interface, and then all the helper stuff to power each could be nicely tucked away in the cpp file for just that algorithm.

If instead we want to keep them organized under the theme of building a snarl distance index, we might give them a folder and a namespace that reflects that, and then again give each piece its own header. Then that whole namespace would have something that sort of constitutes its external interface (populate_snarl_index()?) and around there we would have some documentation on how the different responsibilities are spread across these modules.

adamnovak · 2026-05-07T19:09:55Z

I think there's also still outstanding stuff to do in vgteam/libbdsg#239 around dropping code from the oracle algorithms that aren't being used, and making sure we have a fresh file format version number, which I think is covered by the current review there.

It looks like CI is failing here because the hash-constraining tests weren't actually dropped yet.

electricEpilith · 2026-05-08T19:31:22Z

Dropped the hash tests. Moved refactor commits to hublabel-refactor.

electricEpilith and others added 30 commits November 12, 2025 14:32

some progress on hub label integration?

37848a6

hub labeling in (debugging not finished), also changes to deal with C…

297a11f

…++20 upgrade

Point at compatible libbdsg and get build working on Mac

9d4c2e2

Use the new indexing types and accessors to avoid fetching nodes by t…

8c13cf3

…he wrong thing

Use accessors so we can build the Tiny oversized snarl test index and…

788224d

… get a wrong answer

Try dumping hub label data for debugging

4985468

Add synthetic Boost graph dumping code, and missing semicolon, and li…

86e4e31

…bbdsg that makes labels that can fit

Merge remote-tracking branch 'origin/master' into hublabel

ee5bd54

Use libbdsg with slightly more implemented hub labeling integration

e84e657

Make sure NodeProp fields are not used before initialization

4f31496

Stop trying to look up removed trivial snarls

232a589

Add the debugging to subgraph finding that I needed to fix ChainRecor…

30e392a

…d asserts

Stop trying to interpret the root as a chain in debug prints

9639b68

Turn off debugging after passing existing snarl distance index tests

c0db406

Merge remote-tracking branch 'origin/master' into hublabel

ce1027f

Merge remote-tracking branch 'origin/hublabel' into hublabel

77c2ec2

Make randomized graph test actually exercise oversized snarls sometimes

163764f

Add function for loading a handlegraph from JSON

ddce5f4

Allow cactus-ifying all handle graphs

e56353f

Add synthetic fix for actually populating the unique_ptr right

4f66c25

Commit partial synthetic refactor to use new JSON load method

5436d73

Revert "Commit partial synthetic refactor to use new JSON load method"

2c3721d

This reverts commit 5436d73.

Replace string_to_graph with json2graph

695cff5

Remove a bunch of mostly unused functions for working with Protobuf G…

8ede8df

…raph objects

Mostly-automatically convert tests to use vg::io::json2graph

e472799

Remove duplicative JSON to graph function

904f445

Set up tiny test that breaks oversized snarl logic

809a766

Remove unused cases

f2d4f08

Fill in the dustances through oversized snarls to pass more distance …

caaa512

…indexing test cases

Add exhaustive test for small snarls

f15a5f9

electricEpilith and others added 11 commits April 2, 2026 11:21

move up libbdsg to upgrade snarl distance index version number

7569cc1

add (a substantial amount of) instrumentation for vg giraffe

b81e331

planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix abs() errors on Mac

528ec4e

Co-Authored-By: GitHub Copilot <noreply@github.com>

additional abs() fix

a3760d1

Co-Authored-By: GitHub Copilot <noreply@github.com>

minor print changes

7920c76

Co-Authored-By: GitHub Copilot <noreply@github.com>

Revert "minor print changes"

190469a

This reverts commit 7920c76.

Revert "add (a substantial amount of) instrumentation for vg giraffe"

3e573bd

This reverts commit b81e331.

second try at instrumentation

6cf266c

initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>

Revert "second try at instrumentation"

e5513aa

This reverts commit 6cf266c.

Merge pull request #4868 from electricEpilith/hublabel

bea2589

Hublabel

snarl distance index version number update

e9c3d40

adamnovak reviewed Apr 7, 2026

View reviewed changes

electricEpilith and others added 7 commits April 7, 2026 17:47

Fix typo in src/snarl_distance_index.cpp

e286bf5

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

Merge remote-tracking branch 'origin/master' into hublabel

283abfc

Refactor snarl decomposition capturing and flipping to make tests eas…

a8fc246

…ier to write

Move CHOverlay output to libbdsg

6c32359

Apply my own review suggestions

63f88d5

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

Use libbdsg with some review comments addressed

e12c989

Merge remote-tracking branch 'origin/hublabel' into hublabel

2493339

adamnovak requested changes May 7, 2026

View reviewed changes

electricEpilith closed this May 8, 2026

electricEpilith force-pushed the hublabel branch from 7761798 to 8a79278 Compare May 8, 2026 18:32

electricEpilith reopened this May 8, 2026

electricEpilith force-pushed the hublabel branch from ba9e2c6 to 2493339 Compare May 8, 2026 21:36

electricEpilith marked this pull request as draft May 8, 2026 21:37

electricEpilith added 2 commits May 8, 2026 15:27

Merge commits from master into hublabel

026f153

use correct libbdsg

c5ed39a

Conversation

electricEpilith commented Apr 7, 2026

Changelog Entry

Description

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

electricEpilith Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak commented May 7, 2026

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamnovak commented May 7, 2026

Uh oh!

electricEpilith commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

electricEpilith commented May 8, 2026 •

edited

Loading