Question about graph complexity and downstream suitability of Minigraph-Cactus graph for a haploid fungal species

Hello Minigraph-Cactus developers,

I am working on pangenome graph construction for a haploid filamentous fungal species. I constructed the graph using Minigraph-Cactus with one reference genome assembly and 12 additional high-quality long-read assemblies. These additional assemblies represent isolates from different genetic populations within the species.

The reference genome has four main chromosomes. After constructing the pangenome graph, it generated both a whole-genome graph and chromosome-specific graph files. I inspected the graph statistics using Bandage, vg stats, and direct GFA line counting. The whole GBZ and GFA gave identical values:

Whole graph:
- Nodes: 1,488,318
- Edges: 2,008,574
- Total graph sequence length: 42,189,450 bp
- Self-loops: 0
- Graph topology reported by vg stats: cyclic
- GFA W-walks: 57

The chromosome-specific GFA and VG files also gave matching node, edge, and length values:

Chromosome | Nodes | Edges | Total graph length | Topology from vg stats
Chromosome 1 | 405,066 | 544,626 | 14,050,215 bp | acyclic
Chromosome 2 | 482,467 | 649,895 | 11,901,753 bp | cyclic
Chromosome 3 | 327,048 | 443,005 | 10,760,110 bp | acyclic
Chromosome 4 | 293,690 | 394,827 | 12,662,612 bp | acyclic

I also checked chrom-subproblems/minigraph.split.log for possible chromosome-splitting issues using terms such as drop, failed, ambiguous, filtered, unassigned, reject, and rejected. I did not find any such problem lines. The log showed 48 assigned query contigs and 4 assigned reference contigs.

I wanted to ask whether these node and edge counts look reasonable for a Minigraph-Cactus pangenome graph built from one reference and 12 additional long-read assemblies of a haploid fungal species. Given the number of assemblies and the fact that they come from different genetic populations, are these levels of graph complexity expected, or could they suggest that the graph is overly complex or contains problematic regions?

My main downstream goals are to use this graph for:

- short-read mapping with vg giraffe using 400+ additional isolates,
- graph-based variant and structural variant discovery,
- comparison of core, accessory, and private genomic regions,
- and downstream population genomic analyses.

Before proceeding further, I would be grateful for your advice on whether this graph seems suitable for downstream analysis as constructed. Would you recommend any additional graph QC, clipping, filtering, simplification, or validation steps before using it for vg giraffe mapping and structural variant analysis?

Thank you very much for your time and for developing and maintaining these tools. Any guidance from your experience would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about graph complexity and downstream suitability of Minigraph-Cactus graph for a haploid fungal species #1934

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about graph complexity and downstream suitability of Minigraph-Cactus graph for a haploid fungal species #1934

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions