Skip to content

davide-colombo/eveDepth2

Repository files navigation

eveDepth2

eveDepth2 genotypes known non-retroviral endogenous viral element (nrEVE) loci in Aedes albopictus from short-read NGS data aligned to the AalbF5 reference genome. It takes population-organized host BAM files plus a YAML configuration, builds locus-specific competitive references, assigns per-specimen present or absent calls with explicit ambiguity states, summarizes results at population scale, and produces a global locus-by-population presence matrix together with downstream cross-population analyses, figures, similarity summaries, and a supplementary Excel table.

Installation

Use the packaged Conda environment and install the repository in editable mode so the evedepth2 console command is available locally. The current package metadata and environment target Python 3.10+, and Python 3.11 or newer is a sensible choice for new environments.

conda env create -f environment.yaml
conda activate eveDepth2
pip install -e .

After installation, confirm that the entry point is available:

which evedepth2

Configuration

eveDepth2 uses two YAML configuration files. config.yaml is the gitignored production pipeline configuration used by genotype, backfill-coverage, and finalize-panel; create it by copying config.sample.yaml and editing the real paths and thresholds for your project. export_config.yaml is the gitignored supplementary-export configuration used only by evedepth2 export; create it by copying export_config.sample.yaml and editing the pipeline-output and Excel-output paths it references. The inline comments in both sample files are the authoritative parameter-level documentation and should be read directly before a production run.

The top-level sections in config.sample.yaml are project, which defines the output root and external tool executables; logging, which controls terminal and file logging; resources, which sets worker counts and thread budgets; globals, which defines missing-value and coordinate conventions; inputs, which points to the reference genome, Stage_04 table, BUSCO table, unique-anchor BED, and population BAM hierarchy; outputs, which names the global wide summary file; params, which controls allele construction, extraction, genotypability, evidence filtering, calling thresholds, coverage annotation, minimap2 options, and BUSCO panel selection; and blocks, which optionally restricts the populations processed in a run. export_config.sample.yaml controls the pipeline output directory to read, the global wide file to merge, the Excel workbook to write, the coverage thresholds used for the supplementary-table X value, the sheet name, and the metadata and call columns carried into the workbook.

Commands

All workflows run through the unified evedepth2 dispatcher. The commands below are ordered from core pipeline execution through optional post-processing and downstream analysis.

genotype

evedepth2 genotype -c config.yaml

genotype is the main pipeline run. It reads the production pipeline configuration, validates tool availability and input paths, discovers populations under inputs.populations_root, builds per-locus competitive references, computes BUSCO depth intermediates, genotypes each specimen, and writes specimen-, population-, and global-level outputs under project.out_dir. During execution it reads the reference FASTA, the Stage_04 locus table, the BUSCO full table, the unique-anchor BED, and each specimen BAM matched by inputs.population_bam_glob.

Flag Type Default Meaning
-c, --config PATH required Path to config.yaml.

Example:

evedepth2 genotype -c /projects/nreve/config.yaml

Expected outputs:

  • global/loci_meta.jsonl.gz
  • global/locus_refs/<locus_id>/competitive.fasta
  • global/locus_refs/<locus_id>/ref_meta.json
  • global/busco_intermediates/<population>/<specimen>.busco_depth.tsv.gz
  • populations/<population>/specimens/<specimen_id>/calls.raw.tsv
  • populations/<population>/specimens/<specimen_id>/calls.tsv
  • populations/<population>/population_calls.tsv
  • populations/<population>/population_locus_summary.tsv
  • populations/<population>/metadata.updated.tsv
  • global/global_presence.wide.tsv
  • run_summary.json

The most important result columns are call_state, call_reason, specimen_depth, present_union_junction_reads, absent_junction_reads, left_flank_unique_cov_frac, right_flank_unique_cov_frac, insert_cov_frac, insert_mean_depth, insert_breadth, and insert_norm_depth. Common pitfalls are expecting a nonexistent --population CLI flag, forgetting to copy config.sample.yaml to config.yaml, or using a BAM layout that does not match inputs.populations_root, inputs.population_bam_glob, and inputs.specimen_id_regex.

finalize-panel

evedepth2 finalize-panel -c config.yaml

finalize-panel finalizes the BUSCO depth-normalization panel from BUSCO intermediates that genotype has already written. It does not rebuild global_presence.wide.tsv; it only creates the persisted BUSCO panel used for stable specimen-depth normalization.

Flag Type Default Meaning
-c, --config PATH required Path to config.yaml.

Example:

evedepth2 finalize-panel -c /projects/nreve/config.yaml

Expected outputs:

  • global/busco_panel.tsv
  • global/busco_panel.summary.json

The panel TSV reports gene_id, n_specimens_with_depth, completeness_fraction, mean_depth, and cv_depth. Run this only after genotype has produced global/busco_intermediates/; if that directory is empty, there is no panel to finalize.

backfill-coverage

evedepth2 backfill-coverage \
  --output-dir /path/to/output_root \
  --populations-root /path/to/population_bams \
  --config config.yaml \
  [--populations pop1 pop2 ...] \
  [--dry-run] \
  [--workers 1]

backfill-coverage retroactively annotates existing call tables with insert-coverage metrics. It reuses the pipeline config and the host BAMs to compute insert_mean_depth, insert_breadth, and insert_norm_depth, then rewrites calls.tsv and calls.raw.tsv in place inside an existing output directory.

Flag Type Default Meaning
--output-dir PATH required Pipeline output root containing populations/ and global/.
--populations-root PATH required Root directory containing per-population host BAM folders; normally the same location as inputs.populations_root.
--config PATH required Path to the pipeline config YAML.
--populations STR ... all populations in --output-dir Optional population allowlist.
--dry-run flag False Resolve, validate, and process without writing files.
--workers INT 1 Specimen-level parallelism.

Example:

evedepth2 backfill-coverage \
  --output-dir /projects/nreve/out \
  --populations-root /data/aalb/populations \
  --config /projects/nreve/config.yaml \
  --populations cali tokyo \
  --workers 4

Expected outputs:

  • Updated populations/<population>/specimens/<specimen_id>/calls.tsv
  • Updated populations/<population>/specimens/<specimen_id>/calls.raw.tsv

This command logs how many rows were updated and how many NA values were emitted per coverage column. It reads global/loci_meta.jsonl.gz and, when present, global/busco_panel.tsv. Run it only after a successful genotype run has created the output tree. A common failure mode is pointing --output-dir at populations/ instead of the real output root.

calibrate

evedepth2 calibrate --output-dir /path/to/output_root --out /path/to/calibration.tsv

calibrate summarizes the insert-coverage distributions already present in calls.raw.tsv. It scans the existing pipeline output root, extracts the three coverage columns for supported call categories, writes a flat TSV, and prints median, p10, and p90 summaries to stdout.

Flag Type Default Meaning
--output-dir PATH required Pipeline output root directory, specifically the directory containing populations/.
--out PATH required TSV path for the flattened calibration table.

Example:

evedepth2 calibrate \
  --output-dir /projects/nreve/out \
  --out /projects/nreve/analysis/coverage_calibration.tsv

Expected outputs:

  • A flat TSV with population, specimen_id, locus_id, call, insert_mean_depth, insert_breadth, and insert_norm_depth
  • Per-category percentile summaries printed to stdout

The command reads calls.raw.tsv, not calls.tsv, because calibration is intended to use the archival evidence surface. Run it after genotype or backfill-coverage has populated the coverage columns. The usual mistake is pointing --output-dir at a single population directory instead of the output root.

cross-pop

evedepth2 cross-pop --data-root /path/to/output_root [--output-dir /path/to/cross_pop_run]

cross-pop aggregates the per-population outputs into locus-level cross-population statistics. It calculates present fractions, Wilson confidence intervals, classification labels, differentiation tests, evidence-quality summaries, and specimen-level QC summaries, then writes a timestamped analysis run directory unless --output-dir is explicitly provided.

Flag Type Default Meaning
--data-root PATH /Users/davidecolombo/Desktop/eveDepth4_output_DEFINITIVE Root directory containing populations/ and global/.
--output-dir PATH None Explicit output directory for this analysis run; if omitted, a timestamped run is created under <data_root>/cross_population_analysis/.

Example:

evedepth2 cross-pop \
  --data-root /projects/nreve/out \
  --output-dir /projects/nreve/out/cross_population_analysis/run_paper_v1

Expected outputs:

  • tables/locus_annotations.tsv
  • tables/population_call_state_summary.tsv
  • tables/global_presence_matrix.tsv
  • tables/locus_global_classification.tsv
  • tables/family_summary.tsv
  • tables/accession_summary.tsv
  • tables/cluster_summary.tsv
  • tables/specimen_summary.tsv
  • tables/ambiguous_call_reasons.tsv
  • tables/evidence_quality_summary.tsv
  • tables/polymorphic_loci_detail.tsv
  • summary_report.txt

The most useful columns in global_presence_matrix.tsv are locus_id, the per-population *.present_fraction_over_decided columns, and the matching *.decided_denominator columns. locus_global_classification.tsv adds global_classification, diff_test_pvalue, and diff_test_pvalue_fdr. Run this after the populations of interest have been genotyped. If you omit --output-dir, every invocation creates a new timestamped run directory.

figures

evedepth2 figures --data-root /path/to/output_root [--input-dir /path/to/cross_pop_run] [--output-dir /path/to/figures_dir]

figures renders publication-style figures from a cross-pop run. If --input-dir is omitted, it auto-discovers the latest run_* directory under <data_root>/cross_population_analysis/; if --output-dir is omitted, it writes the figures into that run's figures/ subdirectory.

Flag Type Default Meaning
--data-root PATH /Users/davidecolombo/Desktop/eveDepth4_output_DEFINITIVE Root directory used when auto-discovering the latest cross-population run.
--input-dir PATH None Explicit cross-population run directory containing a tables/ subfolder.
--output-dir PATH None Directory to write the generated figures; defaults to <input_dir>/figures.

Example:

evedepth2 figures \
  --data-root /projects/nreve/out \
  --input-dir /projects/nreve/out/cross_population_analysis/run_paper_v1 \
  --output-dir /projects/nreve/out/cross_population_analysis/run_paper_v1/figures_final

Expected outputs:

  • Paired PDF and PNG figure files
  • figures_manifest.tsv

figures_manifest.tsv records figure_id, filename_pdf, filename_png, source_tables, and description. Run cross-pop first. If you rely on auto-discovery, make sure the latest run_* directory is the one you want.

similarity

evedepth2 similarity --data-root /path/to/output_root [--output-dir /path/to/similarity_dir]

similarity performs the population-similarity analysis based on the latest cross-population run. It resolves the newest run_* directory under <data-root>/cross_population_analysis/ and, if none exists there, falls back to a legacy cross_pop_runs/ directory.

Flag Type Default Meaning
--data-root PATH required Pipeline output root containing cross-population analysis outputs.
--output-dir PATH <data-root>/cross_population_analysis Directory to write the similarity results.

Example:

evedepth2 similarity \
  --data-root /projects/nreve/out \
  --output-dir /projects/nreve/out/cross_population_analysis/similarity_paper_v1

Expected outputs:

  • subset_A_all_informative.tsv
  • subset_B_polymorphic.tsv
  • subset_C_high_confidence.tsv
  • braycurtis_subsetA.tsv
  • braycurtis_subsetB.tsv
  • braycurtis_subsetC.tsv
  • jaccard_subsetA.tsv
  • jaccard_subsetB.tsv
  • jaccard_subsetC.tsv
  • Figure files
  • similarity_analysis_report.txt

At startup the command prints the legacy hardcoded input and output paths and their replacements. The subset tables contain locus_id, global_classification, flag_high_decoy, status, and reason. Run cross-pop first. If you accept the default output directory, the similarity results are written directly under <data-root>/cross_population_analysis, alongside the run_* directories.

export

evedepth2 export --config export_config.yaml

export creates the supplementary Excel workbook described by export_config.yaml. The config file tells it where the pipeline output lives, which global_presence.wide.tsv file to merge, which worksheet name to use, which metadata columns to include, and which coverage thresholds define the X value in the final table.

Flag Type Default Meaning
--config PATH required Path to export_config.yaml.

Example:

evedepth2 export --config /projects/nreve/export_config.yaml

Expected outputs:

  • A formatted Excel workbook written to output_excel

The command reads the configured global_presence_file plus every parseable calls.tsv under the configured output_dir, computes the population-level X and Y values, and writes the workbook to output_excel. The exported sheet combines locus metadata with one column per population. The production threshold block in export_config.yaml is coverage_presence_threshold, containing insert_norm_depth and insert_breadth. Run this after genotype has produced global_presence.wide.tsv and the specimen-level calls.tsv files referenced by the export config.

Output Structure

The exact output tree depends on which optional commands you run, but a typical project that has completed genotyping, BUSCO-panel finalization, cross-population analysis, figure generation, and similarity analysis looks like this:

<out_root>/
├── global/
│   ├── busco/
│   │   ├── busco_genes.json.gz
│   │   └── busco_retained.bed
│   ├── busco_intermediates/<population>/<specimen>.busco_depth.tsv.gz
│   ├── busco_panel.tsv
│   ├── busco_panel.summary.json
│   ├── loci_meta.jsonl.gz
│   ├── locus_refs/<locus_id>/
│   │   ├── competitive.fasta
│   │   └── ref_meta.json
│   └── global_presence.wide.tsv
├── populations/<population>/
│   ├── metadata.updated.tsv
│   ├── population_calls.tsv
│   ├── population_locus_summary.tsv
│   ├── specimen_failures.tsv
│   └── specimens/<specimen_id>/
│       ├── calls.raw.tsv
│       ├── calls.tsv
│       └── specimen.log
├── cross_population_analysis/
│   ├── run_YYYYMMDD_HHMMSS/
│   │   ├── tables/
│   │   │   ├── accession_summary.tsv
│   │   │   ├── ambiguous_call_reasons.tsv
│   │   │   ├── cluster_summary.tsv
│   │   │   ├── evidence_quality_summary.tsv
│   │   │   ├── family_summary.tsv
│   │   │   ├── global_presence_matrix.tsv
│   │   │   ├── locus_annotations.tsv
│   │   │   ├── locus_global_classification.tsv
│   │   │   ├── polymorphic_loci_detail.tsv
│   │   │   ├── population_call_state_summary.tsv
│   │   │   └── specimen_summary.tsv
│   │   ├── figures/
│   │   │   ├── *.pdf
│   │   │   ├── *.png
│   │   │   └── figures_manifest.tsv
│   │   └── summary_report.txt
│   ├── braycurtis_subsetA.tsv
│   ├── braycurtis_subsetB.tsv
│   ├── braycurtis_subsetC.tsv
│   ├── jaccard_subsetA.tsv
│   ├── jaccard_subsetB.tsv
│   ├── jaccard_subsetC.tsv
│   ├── subset_A_all_informative.tsv
│   ├── subset_B_polymorphic.tsv
│   ├── subset_C_high_confidence.tsv
│   └── similarity_analysis_report.txt
└── run_summary.json

calls.raw.tsv preserves the full evidence vector for each specimen and locus combination, while calls.tsv is the primary specimen-facing call table. population_calls.tsv concatenates all successful specimen call tables in a population, population_locus_summary.tsv collapses those calls into per-locus counts and present fractions, and metadata.updated.tsv joins the locus metadata back onto that summary. global_presence.wide.tsv is the project-wide matrix used by downstream analyses. busco_panel.tsv and busco_panel.summary.json are written by finalize-panel. cross_population_analysis/run_*/tables is written by cross-pop, figures/ is written by figures, and the similarity matrices and reports at the top level of cross_population_analysis/ are written by similarity. The supplementary Excel workbook written by export goes wherever output_excel points in export_config.yaml, so it may live outside <out_root>.

Genotyping Algorithm

The pipeline is locus-centric. For each known insertion site, eveDepth2 builds a competitive reference containing a present allele, an absent allele, and any relevant decoy sequences from related loci. Host-genome BAM reads are extracted from the locus window, remapped against that competitive reference, and converted into fragment-level evidence. The core biological question is whether the specimen shows reads that support the viral insert joining the flanking host genome, or reads that support the empty host junction in the absence of the insert.

Before any specimen-level decision is made, each locus is assessed for genotypability. eveDepth2 requires enough unique host-flank sequence to anchor junction-spanning reads unambiguously. Loci that do not satisfy the configured unique-anchor rule are labeled uncallable for all specimens. This is a property of the locus context, not a property of the specimen.

Evidence is then assembled from competitively remapped fragments. Present evidence comes from reads or read pairs that span the left or right insert junction. Absent evidence comes from reads spanning the empty-site junction. Internal insert coverage and unique-flank coverage are also tracked. The pipeline uses weighted fragment allocation when a fragment maps to more than one target, so decoy competition and multi-mapping are explicit parts of the evidence model rather than hidden artifacts.

Five specimen-level call states are possible. present means the specimen has sufficient present-junction support, acceptable flank support, and low enough decoy competition to support the insertion. absent means the specimen has sufficient empty-site support, low enough insert signal, and acceptable decoy competition. ambiguous means evidence exists but is internally inconsistent, such as asymmetric junction support, competing present and absent signals, or high decoy involvement. low_coverage means the specimen or locus does not have enough usable evidence to support a decision. uncallable means the locus itself lacks enough unique anchor sequence for short-read genotyping.

Decoy handling is central to interpretation. Related loci, duplicate loci, or loci sharing high sequence similarity can attract the same fragments. eveDepth2 measures how much insertion-informative evidence is allocated to decoy targets and carries those fractions into the decision process. High decoy burden does not imply biological absence, but it does reduce confidence that the evidence can be assigned to one locus cleanly.

Specimen depth is normalized with BUSCO genes. For each specimen, eveDepth2 computes a depth proxy from a panel of single-copy BUSCO loci and uses that scalar both for call gating and for normalized insert-coverage metrics. This makes depth-dependent thresholds interpretable across specimens sequenced at different coverage.

The primary outputs preserve the full evidence surface needed for interpretation. calls.tsv and calls.raw.tsv record the final call together with junction counts, flank-coverage fractions, insert-coverage fractions, decoy fractions, specimen depth, and the insert-coverage annotation columns. Population summaries then aggregate those specimen-level decisions into per-locus present, absent, ambiguous, low_coverage, and uncallable counts.

Coverage Annotation

Junction-based genotyping asks how the viral insert connects to its flanking host genome. Insert-coverage annotation asks a complementary question: does the original host-genome BAM contain reads mapping across the insert body itself? These evidence streams answer different biological questions. A locus can fail junction-based genotyping because the flanks do not contain enough unique anchor sequence, yet the insert interval can still be robustly covered by reads and therefore support biological presence. Conversely, weak insert coverage is not evidence of absence, because low values can result from low specimen depth, poor local mappability, or strict read filtering.

Three columns encode this annotation. insert_mean_depth is the mean per-base depth across the insert interval only. insert_breadth is the fraction of insert bases covered by at least one read. insert_norm_depth is the mean insert depth normalized by the specimen BUSCO depth proxy. Flanks are excluded deliberately because host flanks are expected to be covered whether the insert is present or absent, and including them would erase the biological signal. Duplicates, secondary alignments, and supplementary alignments are excluded for consistency with the rest of the pipeline.

Normalization uses BUSCO genes because specimens are sequenced at different depths. The BUSCO depth proxy is the median depth across a panel of complete, single-copy BUSCO genes. Under that scaling, a normalized insert depth near 1.0 is consistent with diploid presence, substantially higher values can reflect higher-than-diploid copy number or multi-locus mapping, and lower values can reflect partial coverage, hemizygosity, or absence. A hemizygous insertion is expected to produce insert_norm_depth near 0.5 because only one homolog contributes insert-derived reads relative to a diploid BUSCO baseline.

Coverage annotation is useful precisely where junction evidence becomes conservative. uncallable loci often reflect a failure of flank uniqueness rather than a biological absence of the insertion, and ambiguous loci can reflect real insert signal mixed with mapping competition or asymmetric junction support. The coverage columns therefore provide an orthogonal way to prioritize likely-present loci among non-definitive call states without changing the primary decision tree.

Calibration aggregates these metrics across call categories to show how confirmed present calls compare with ambiguous, low_coverage, and uncallable loci. The current global summary across 11 populations and 179 specimens is:

call insert_mean_depth_median insert_mean_depth_p10 insert_mean_depth_p90 insert_breadth_median insert_breadth_p10 insert_breadth_p90 insert_norm_depth_median insert_norm_depth_p10 insert_norm_depth_p90
present 24.0000 12.1800 42.0700 1.0000 0.9821 1.0000 1.1732 0.5985 1.9610
ambiguous 13.6550 2.3730 29.9040 0.9709 0.3979 1.0000 0.6731 0.1190 1.4033
low_coverage 1.9400 0.0000 20.8250 0.3130 0.0000 1.0000 0.1108 0.0000 1.1511
uncallable 20.5100 3.1800 42.5910 1.0000 0.5859 1.0000 1.0503 0.1556 1.9983

The dominant pattern is that uncallable loci remain close to present on central tendency, supporting the interpretation that many uncallable loci are biologically present but fail the flank-uniqueness requirement. ambiguous loci remain intermediate and heterogeneous, while low_coverage remains weakly informative. The earlier Rome pilot showed the same qualitative pattern:

call insert_mean_depth_median insert_mean_depth_p10 insert_mean_depth_p90 insert_breadth_median insert_breadth_p10 insert_breadth_p90 insert_norm_depth_median insert_norm_depth_p10 insert_norm_depth_p90
present 21.1700 10.7420 37.1960 1.0000 0.9866 1.0000 1.2328 0.6347 2.1257
ambiguous 11.3550 2.1970 23.6530 0.9697 0.3753 1.0000 0.6768 0.1394 1.3605
low_coverage 1.2800 0.0000 19.7020 0.1864 0.0000 1.0000 0.0715 0.0000 1.1549
uncallable 19.8500 3.9630 39.5040 1.0000 0.7118 1.0000 1.1696 0.2293 2.2306

Two threshold modes are justified by the global calibration. The specificity-oriented mode is insert_norm_depth >= 1.0 and insert_breadth >= 0.95. The sensitivity-oriented mode is insert_norm_depth >= 0.5 and insert_breadth >= 0.95. The breadth threshold remains conservative because the lower tail of confirmed-present breadth is still high, and the lower tail of confirmed-present normalized depth is biologically consistent with hemizygous insertions. These are alternative operating points rather than universal replacement rules.

The supplementary table uses a more permissive coverage criterion for its X value: insert_norm_depth >= 0.3 and insert_breadth >= 0.90. Those thresholds were selected empirically from a grid search on specimens with unambiguous present or absent genotypes. Lowering breadth from 0.95 to 0.90 recovered most false negatives and increased global sensitivity to approximately 0.977, while lowering depth from 0.5 to 0.3 retained genuine low-depth insertions without introducing observed false positives in the calibration set.

Cross-Population Analysis

For each locus and population, eveDepth2 tracks specimen-level call states present, absent, ambiguous, low_coverage, and uncallable. Cross-population summaries treat only present and absent as decided calls. The decided denominator is therefore present + absent, and the point estimate reported for a locus in a population is present_fraction_over_decided = present / (present + absent). This is explicitly a prevalence estimate conditional on resolved genotypes, not on all observations.

For each locus and population with a nonzero decided denominator, the analysis computes a 95% Wilson score interval. If k is the present count, n is the decided denominator, p̂ = k / n, and z = 1.96, then:

$$ \begin{aligned} \text{center} &= \frac{\hat{p}+z^2/(2n)}{1+z^2/n}, \\ \text{halfwidth} &= \frac{z}{1+z^2/n}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}+\frac{z^2}{4n^2}} \end{aligned} $$

$$ \mathrm{CI}_{95%}=[\text{center}-\text{halfwidth},\ \text{center}+\text{halfwidth}] $$

Wilson intervals are used because they behave better than Wald intervals near 0 and 1 and are less conservative than Clopper-Pearson intervals in this setting.

Loci are then classified across populations. fixed_present requires the lower Wilson bound to exceed 0.95 in every population with decided data. near_fixed_present requires a present fraction of at least 0.90 in every such population but fails the stricter Wilson criterion in at least one. fixed_absent requires the upper Wilson bound to remain below 0.05 in every population with decided data. near_fixed_absent requires a present fraction of at most 0.10 in every population but fails the stricter Wilson criterion in at least one. polymorphic means at least one population has a present fraction strictly between 0 and 1. population_differentiated means the between-population differentiation test remains significant after false-discovery-rate correction. low_data means no population has decided data or every population with decided data has fewer than three decided specimens. These labels are intentionally non-exclusive.

Population differentiation is tested on a 2 × P table of present and absent counts across populations. If P = 2, the analysis uses a two-sided Fisher exact test. If P > 2, it uses a permutation test on the Pearson chi-square statistic with no Yates correction. Under the null, present and absent labels are exchangeable across populations. A pooled label vector preserving the total number of present calls is shuffled repeatedly, repartitioned into population blocks of sizes n_i, and converted into a permuted chi-square statistic. The p-value is the fraction of permutations with statistic greater than or equal to the observed value, using Laplace smoothing:

$$ p = \frac{c+1}{B+1} $$

where c is the number of permutations meeting or exceeding the observed statistic and B is the number of permutations. Raw p-values are stored in diff_test_pvalue. Benjamini-Hochberg correction is then applied across all tested loci, and the adjusted value is stored in diff_test_pvalue_fdr. A locus is labeled population_differentiated if diff_test_pvalue_fdr < 0.05.

Missingness is summarized rather than corrected. Per locus and population, ambiguous_rate = ambiguous / total_observations over the full call-state total, and undecided_rate = (ambiguous + low_coverage + uncallable) / total_observations. At locus scale, the median ambiguous rate and median undecided rate across populations are reported, and flag_high_missingness is set when the median undecided rate reaches 0.40 or higher. This flag identifies loci where the decided subset may be non-random with respect to latent true presence state.

Population similarity analyses operate on three locus subsets: all informative loci, polymorphic loci, and high-confidence loci. For each subset, the similarity workflow computes Bray-Curtis dissimilarity and Jaccard similarity matrices between populations, then performs hierarchical clustering and ordination-based visualization. Bray-Curtis emphasizes abundance-like differences in present-fraction profiles, while Jaccard emphasizes overlap in binary presence patterns. The resulting matrices are written as square population-by-population tables and are intended to complement, not replace, the locus-wise differentiation framework.

The figure outputs should be interpreted with the same denominators in mind. The global heatmap displays present_fraction_over_decided. Population call-state composition plots are quality-control views over all locus-specimen observations, not prevalence estimators. Polymorphic strip plots use marker size to encode decided denominator. Family-level summaries and cluster-level plots are descriptive views of patterned variation, not dependence-corrected inferential models.

Supplementary Table

The supplementary Excel table is a locus-level summary across 11 global populations of Aedes albopictus. Each population cell is reported as X (Y), combining two independent evidence streams: junction-based genotyping and insert-body coverage.

The static metadata columns describe locus annotation and genomic context. They include locus identifier, EVE identifier, copy index, viral family, GenBank accession, genomic coordinates, and self-BLAST cluster metadata such as cluster ID, cluster size, representative locus, cluster-representative status, and duplicate status.

Y is the fraction of specimens in a population for which the locus was genotyped as present by eveDepth2. The denominator is all specimens observed for that locus in that population, including unresolved calls. The numerator is the number of specimens with call_state == present in the corresponding calls.tsv output. Junction-based presence remains the primary and authoritative evidence stream because it is tied directly to reads crossing the host-flank to viral-insert boundaries.

X is the fraction of specimens in a population for which the locus shows sufficient insert-body coverage to support presence independently of the junction call state. The denominator is the same as for Y. The numerator is the number of specimens satisfying both insert_norm_depth >= 0.3 and insert_breadth >= 0.90. insert_norm_depth is the mean insert-body depth normalized by specimen-level BUSCO depth, and insert_breadth is the fraction of insert-body positions covered by at least one read. These metrics are computed from primary alignments only, with secondary, supplementary, and duplicate alignments excluded.

The X thresholds were selected empirically by treating junction-based present and absent calls as ground truth and sweeping depth thresholds from 0.10 to 1.50 in 0.05 increments and breadth thresholds from 0.50 to 1.00 in 0.05 increments. Optimization prioritized recall with the F2 score because false negatives directly depress estimated insertion prevalence. Initial thresholds of depth >= 0.5 and breadth >= 0.95 yielded a global sensitivity of 0.913 against junction-confirmed present calls. Failure analysis showed both a threshold-boundary effect around breadth 0.94 to 0.95 and locus-specific breadth ceilings at some loci such as aalbf5_EVE006|0 and aalbf5_EVE086|0. Lowering breadth to 0.90 and depth to 0.3 recovered most false negatives and increased global sensitivity to approximately 0.977 while preserving perfect observed precision across the calibration grid.

Divergence between X and Y is informative rather than anomalous. Y > X indicates specimens where junction evidence supports presence but insert-body coverage falls below the X threshold; this is biologically plausible for short inserts, repetitive insert interiors, or loci near the coverage threshold boundary. X > Y is less common and can arise when insert-body reads are present without clean junction support, which may reflect mapping artifact or structural complexity at the insertion boundaries. Persistent Y > X across all populations for one locus is a useful indicator of locus-specific mappability constraints.

The table is generated with:

evedepth2 export --config export_config.yaml

Repository Structure

The repository layout after documentation consolidation is:

eveDepth2/
├── analysis/                  # Installed analysis package and downstream CLI subcommands
│   ├── cross_pop_analysis.py
│   ├── export_supplementary.py
│   ├── make_figures.py
│   └── similarity_analysis.py
├── assets/                    # Static project assets
│   └── graphviz/
├── pipeline/                  # Installed pipeline package and core CLI subcommands
│   ├── backfill_coverage_annotation.py
│   ├── cli.py
│   ├── coverage_calibration.py
│   ├── finalize_global_panel.py
│   └── run_pipeline.py
├── scripts/                   # Helper shell scripts used outside the Python CLI
├── tests/                     # Automated tests for pipeline logic and export behavior
├── config.sample.yaml         # Annotated pipeline configuration template
├── environment.yaml           # Conda environment with runtime and analysis dependencies
├── export_config.sample.yaml  # Annotated supplementary-export configuration template
├── pyproject.toml             # Packaging metadata and console entry point
└── README.md                  # Consolidated user and methods documentation

pipeline/ contains the genotyping engine and CLI dispatcher, analysis/ contains the installed downstream analysis commands, scripts/ keeps standalone shell helpers, and tests/ covers the core calling logic and supplementary export workflow.

About

nrEVE genotyping pipeline for Aedes albopictus — junction-based presence/absence calling across populations from short-read NGS data

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors