bsaseq

Bulk Segregant Analysis for QTL mapping from pooled whole-genome sequencing data.

bsaseq identifies genomic loci controlling traits by comparing allele frequencies between phenotypically distinct bulked DNA pools. It provides a complete workflow from VCF input to annotated candidate genes.

Features

Multi-sample bulk support (pool technical replicates)
Sliding window analysis with tricube smoothing
Z-score and G-statistic for candidate region detection
Publication-quality genome-wide and regional plots
Optional snpEff integration for variant annotation
Gene-level candidate ranking

Installation

From PyPI (recommended)

pip install bsaseq

From Bioconda

conda install -c bioconda bsaseq

From source

git clone https://github.com/rcac-bioinformatics/bsaseq.git
cd bsaseq
pip install -e .

Dependencies

Required:

Python 3.9-3.13
cyvcf2, numpy, scipy, matplotlib, click, rich

Optional:

snpEff (for variant annotation)

Quick start

# Basic analysis
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk mutant_pool \
    --low-bulk wildtype_pool \
    --out results/my_analysis

# With multiple samples per bulk
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk "mut_rep1,mut_rep2" \
    --low-bulk "wt_rep1,wt_rep2" \
    --out results/my_analysis

# With annotation
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk mutant_pool \
    --low-bulk wildtype_pool \
    --out results/my_analysis \
    --annotate \
    --snpeff-db Sorghum_bicolor

Commands

Command	Description
`bsaseq run`	Complete BSA analysis pipeline
`bsaseq samples`	List sample names in VCF
`bsaseq plot`	Regenerate plots from existing output
`bsaseq annotate`	Annotate candidates with snpEff
`bsaseq check-snpeff`	Verify snpEff installation

Output files

File	Description
`*_variants.tsv`	Per-variant allele frequencies
`*_windows.tsv`	Sliding window statistics
`*_regions.tsv`	Candidate genomic regions
`*_regions.bed`	Candidate regions in BED format
`*_candidates.tsv`	Filtered candidate variants
`*_annotated_candidates.tsv`	Candidates with snpEff annotation
`*_candidate_genes.tsv`	Gene-level summary
`*_summary.txt`	Analysis summary report
`*_genome_wide.png/pdf`	Genome-wide Manhattan plot
`_region_.png/pdf`	Zoomed candidate region plots
`*_af_distribution.png`	Allele frequency diagnostics
`*_depth_distribution.png`	Read depth diagnostics

Parameters

Filtering

Parameter	Default	Description
`--min-dp`	10	Minimum read depth per sample
`--max-dp`	200	Maximum read depth (excludes repeats)
`--min-gq`	20	Minimum genotype quality
`--min-qual`	30	Minimum variant QUAL score

Window analysis

Parameter	Default	Description
`--window-size`	1000000	Window width in bp
`--step-size`	250000	Step between windows in bp
`--min-variants`	5	Minimum variants per window

Candidate detection

Parameter	Default	Description
`--z-threshold`	3.0	Z-score cutoff for significance
`--mode`	recessive	Inheritance mode (recessive/dominant)

Methodology

Delta allele frequency (delta-AF)

For each SNP, bsaseq calculates the difference in alternate allele frequency between the high bulk (mutant pool) and low bulk (wild-type pool):

delta_AF = AF_high - AF_low

For a recessive causal mutation:

In the mutant bulk: individuals are homozygous mutant, so AF ~ 1.0
In the wild-type bulk: individuals are homozygous reference, so AF ~ 0.0
Expected delta_AF ~ 1.0 at the causal locus

Sliding window analysis

Variants are analyzed in overlapping genomic windows to reduce noise:

Tricube-weighted smoothing: Variants near window center contribute more
G-statistic: Tests for significant allele frequency differences
Z-score normalization: Genome-wide standardization for peak calling

Candidate region detection

Regions are identified where consecutive windows exceed the Z-score threshold. Adjacent significant regions within 500 kb are merged.

Candidate variant filtering

Within candidate regions, variants are filtered based on inheritance mode:

Mode	min_delta_AF	min_AF_high	max_AF_low
Recessive	0.8	0.9	0.1
Dominant	0.3	0.4	0.1

Input requirements

The VCF file must contain:

Biallelic SNPs (multiallelic and indels are skipped)
AD (allelic depth) FORMAT field for allele frequency calculation
GQ (genotype quality) FORMAT field for filtering (optional but recommended)

Recommended variant calling:

# GATK HaplotypeCaller (recommended)
gatk HaplotypeCaller -R ref.fa -I mutant_pool.bam -I wildtype_pool.bam -O calls.vcf.gz

# bcftools mpileup (alternative)
bcftools mpileup -f ref.fa mutant_pool.bam wildtype_pool.bam | bcftools call -mv -O z -o calls.vcf.gz

Examples

Example 1: Basic recessive mutation mapping

bsaseq run \
    --vcf pooled_calls.vcf.gz \
    --high-bulk mutant \
    --low-bulk wildtype \
    --out results/analysis \
    --mode recessive

Example 2: Dominant trait with relaxed thresholds

bsaseq run \
    --vcf pooled_calls.vcf.gz \
    --high-bulk affected \
    --low-bulk unaffected \
    --out results/dominant \
    --mode dominant \
    --z-threshold 2.5 \
    --min-dp 5

Example 3: High-coverage data with annotation

bsaseq run \
    --vcf deep_seq.vcf.gz \
    --high-bulk mut1,mut2 \
    --low-bulk wt1,wt2 \
    --out results/annotated \
    --min-dp 30 \
    --max-dp 500 \
    --annotate \
    --snpeff-db Arabidopsis_thaliana

Example 4: Regenerate plots with different format

bsaseq plot \
    --windows results/analysis_windows.tsv \
    --regions results/analysis_regions.tsv \
    --variants results/analysis_variants.tsv \
    --out new_plots \
    --format pdf

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=bsaseq --cov-report=term-missing

# Type checking
mypy src/bsaseq

Citation

If you use bsaseq in your research, please cite:

bsaseq: A tool for Bulk Segregant Analysis of pooled sequencing data (in preparation)

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
docs		docs
paper		paper
recipes/bioconda		recipes/bioconda
src/bsaseq		src/bsaseq
tests		tests
workflows		workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bsaseq

Features

Installation

From PyPI (recommended)

From Bioconda

From source

Dependencies

Quick start

Commands

Output files

Parameters

Filtering

Window analysis

Candidate detection

Methodology

Delta allele frequency (delta-AF)

Sliding window analysis

Candidate region detection

Candidate variant filtering

Input requirements

Examples

Example 1: Basic recessive mutation mapping

Example 2: Dominant trait with relaxed thresholds

Example 3: High-coverage data with annotation

Example 4: Regenerate plots with different format

Development

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bsaseq

Features

Installation

From PyPI (recommended)

From Bioconda

From source

Dependencies

Quick start

Commands

Output files

Parameters

Filtering

Window analysis

Candidate detection

Methodology

Delta allele frequency (delta-AF)

Sliding window analysis

Candidate region detection

Candidate variant filtering

Input requirements

Examples

Example 1: Basic recessive mutation mapping

Example 2: Dominant trait with relaxed thresholds

Example 3: High-coverage data with annotation

Example 4: Regenerate plots with different format

Development

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages