Hublabel#4870
Conversation
… get a wrong answer
…bbdsg that makes labels that can fit
This reverts commit 5436d73.
…indexing test cases
planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
This reverts commit 7920c76.
This reverts commit b81e331.
initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>
This reverts commit 6cf266c.
adamnovak
left a comment
There was a problem hiding this comment.
Here's my review, including a bunch of stuff I now want to change about code I committed.
The changes to libbdsg also need to be reviewed.
| // re-preload right before cache_payloads. The double-preload is | ||
| // necessary: a single preload just before cache_payloads isn't enough | ||
| // to keep the index resident under the memory pressure of 32 parallel | ||
| // threads and the remaining in-memory data structures. |
There was a problem hiding this comment.
I don't think that's how page caching works; if it's paged in it's paged in, right? It can't possibly get more paged in if you add a second, earlier copy of the step where it got paged in.
We shouldn't be doing magic; if somehow this genuinely does improve performance we need to be able to explain why, in terms of something like a page eviction algorithm we can link to that's trying to be clever and really does care how long something has been cached.
There was a problem hiding this comment.
This actually improves speed (at the cost of more memory usage). idk why yet
|
|
||
| // Step 2: Build pairing vector mapping each begin to its matching end | ||
| // and vice versa, using separate stacks for chains and snarls. | ||
| std::vector<size_t> pair_of(events.size()); |
There was a problem hiding this comment.
pair_of is a bad name for this, it ought to be something like other_end.
(I'm pretty sure I put it in though.)
| // page cache. We also preload eagerly right after loading the index (in | ||
| // minimizer_main.cpp) so the kernel treats those pages as recently-used; | ||
| // together the two preloads prevent cache_payloads from page-faulting on | ||
| // every node under the memory pressure of 32 parallel threads. |
There was a problem hiding this comment.
This doesn't really explain how two passes of preloading could possibly help, either.
| * - Normal snarl: all rows | ||
| * - Oversized snarl: boundaries and tips | ||
| * - size_limit == 0: no distances in index, so no rows | ||
| * - Top-level chain distances only: ??? |
There was a problem hiding this comment.
It would be good to figure this out and fill this in.
| #ifdef debug_hub_label_build | ||
| // Dump CHOverlay graph to stderr for debugging | ||
| std::cerr << "=== CHOverlay Graph Dump ===" << std::endl; | ||
| std::cerr << "Vertices: " << num_vertices(ov) << ", Edges: " << num_edges(ov) << std::endl; | ||
| std::cerr << "--- Nodes ---" << std::endl; | ||
| for (auto v : boost::make_iterator_range(vertices(ov))) { | ||
| const NodeProp& np = ov[v]; | ||
| std::cerr << "Node " << v << ": seqlen=" << np.seqlen | ||
| << " max_out=" << np.max_out | ||
| << " contracted_neighbors=" << np.contracted_neighbors | ||
| << " level=" << np.level | ||
| << " arc_cover=" << np.arc_cover | ||
| << " contracted=" << (np.contracted ? "true" : "false") | ||
| // Skip new_id since it is not initialized until make_contraction_hierarchy is run. | ||
| << std::endl; | ||
| } | ||
| std::cerr << "--- Edges ---" << std::endl; | ||
| for (auto e : boost::make_iterator_range(edges(ov))) { | ||
| const EdgeProp& ep = ov[e]; | ||
| std::cerr << "Edge " << source(e, ov) << " -> " << target(e, ov) | ||
| << ": contracted=" << (ep.contracted ? "true" : "false") | ||
| << " weight=" << ep.weight | ||
| << " arc_cover=" << ep.arc_cover | ||
| << " ori=" << (ep.ori ? "true" : "false") << std::endl; | ||
| } | ||
| std::cerr << "=== End CHOverlay Dump ===" << std::endl; | ||
| #endif |
There was a problem hiding this comment.
I think I put it it here, but this could stand to become a CHOverlay method or debug function.
| if ( (temp_snarl_record.node_count > size_limit || size_limit == 0 || only_top_level_chain_distances) && (temp_snarl_record.is_root_snarl || start_normal_child)) { | ||
| //If we don't care about internal distances, and we also are not at a boundary or tip | ||
| //TODO: Why do we care about tips specifically? | ||
| continue; | ||
| } | ||
| //getting here means snarl is not oversized |
There was a problem hiding this comment.
We only even get into this function now if we're not oversized, or if size_limit is 0 and we're not including distances at all. This should be reworked to understand that.
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
|
This goes with vgteam/libbdsg#239, for reference. |
adamnovak
left a comment
There was a problem hiding this comment.
I like the idea of breaking up all the stuff in snarl_distance_index.cpp into more distinct units. But I think the design and especially documentation of the units needs work, and we also need some overarching organization of those units that isn't just several cpp files that all belong to one header.
I think some of these pieces might make sense as a few different algorithms in vg::algorithms. There each algorithm could have its own header defining the interface, and then all the helper stuff to power each could be nicely tucked away in the cpp file for just that algorithm.
If instead we want to keep them organized under the theme of building a snarl distance index, we might give them a folder and a namespace that reflects that, and then again give each piece its own header. Then that whole namespace would have something that sort of constitutes its external interface (populate_snarl_index()?) and around there we would have some documentation on how the different responsibilities are spread across these modules.
|
I think there's also still outstanding stuff to do in vgteam/libbdsg#239 around dropping code from the oracle algorithms that aren't being used, and making sure we have a fresh file format version number, which I think is covered by the current review there. It looks like CI is failing here because the hash-constraining tests weren't actually dropped yet. |
|
Dropped the hash tests. Moved refactor commits to |
Changelog Entry
To be copied to the draft changelog by merger:
Description
Adds hub labeling functionality to the snarl distance index.