statgen

statgen provides Python, MATLAB/Octave, and R tools for working with reference-aligned statistical genetics data: reference variants, genotypes, LD, annotations, and GWAS summary statistics.

All panel objects use reference coordinates. Genome build, contig naming, and allele orientation should be resolved upstream by genomatch or an equivalent pipeline.

Where to Start

Preparing inputs and caches from PLINK, summary statistics, and BED files: Tutorial 1 builds an LD distribution and saves Python and MATLAB caches from source files.
Running Python analyses: Tutorial 2 runs LD operations, prunes variants, and fetches genotypes.
Running MATLAB analyses: Tutorial 3 runs the same analysis workflow in MATLAB.
Running R analyses: Tutorial 4 runs the same workflow with the R package.
Looking up MATLAB API help interactively: help statgen.
Finding installation commands, core concepts, or the API map: Input Layouts, Concepts That Matter, Setup, and Key APIs.
Following changes or getting help: Changelog and issue tracker.
Developer docs for implementing or changing statgen behavior: spec/SPEC.md.

Object Overview

Object	Meaning	Typical input
`ReferencePanel`	Ordered reference variant table	External PLINK `.bim`, either one file or an `@`-sharded path
`LDPanel`	Sparse LD distribution tied to a matching reference	Built LD distribution directory produced by `statgen` from PLINK bfiles
`AnnotationPanel`	Binary annotation matrix painted onto the reference	External BED files
`GenotypePanel`	Reference-aligned PLINK genotype metadata with on-demand hardcall access	External PLINK 1 bfile prefix, either one bfile or an `@`-sharded prefix
`Sumstats`	One aligned summary-statistics trait or source	External `.tsv.gz` with `chr`, `bp`, `a1`, `a2`, `p`, and optional columns such as `beta`, `se`, `z`, and `n`

Input Layouts

Reference input may be one .bim file or an @-sharded .bim template.
Genotype input may be one PLINK bfile prefix or an @-sharded bfile prefix.
LD input is a distribution directory with ld_manifest.json, not a single shard file or an @ template.
Sumstats input is one TSV/TSV.GZ file per trait or source, not a sharded input.
Annotation input is one BED file per annotation column; multiple BED files are annotations, not chromosome shards.
Cache loaders read one cache file and may optionally return a logical shard subset.

Concepts That Matter

All analysis objects are aligned to a ReferencePanel.
Supported contig labels are 1-22 and X; labels are not normalized, so chr1/chrX inputs should be fixed upstream.
LD is a directory distribution with an ld_manifest.json file, not a normal object cache.
LD should be built from an unrelated, single-ancestry genotype reference panel appropriate for the downstream analyses.
Python builds the canonical ld_npz/ distribution. R prepares that same .npz distribution in place with an R reference-cache sidecar and optional extracted shard caches. MATLAB/Octave uses a separate ld_mat/ distribution converted from the Python-built LD shards because it needs native sparse .mat files.
LD panels preserve the full reference SNP axis. If a forced LD build contains monomorphic SNPs, LD involving those SNPs is undefined and represented by omitted sparse entries.
Genotype caches store metadata and source BED paths, not dense genotypes.
chrX LD sex-specific shards use FAM sex labels.
Genotype .ploidy sidecars belong to genotype access, not LD reference inputs.
Missing genotype .ploidy sidecars imply (2, 2) ploidy for autosomes and (1, 2) for chrX, with a warning for chrX. Provide .ploidy for PAR/non-PAR mixtures or nonstandard chrX encodings.
Shard objects are internal; users work with panel objects and select_shards.

Setup

Python

Install from a cloned repository:

pip install -e python/

Then import APIs from the relevant submodule, for example:

from statgen.reference import load_reference
from statgen.sumstats import load_sumstats

MATLAB

From a cloned repository, add the MATLAB package folder to the path:

addpath('/path/to/statgen/matlab')

Alternatively, download the MATLAB bundle from a GitHub release and add that bundle's matlab folder to the path.

MATLAB help text is available for public functions, public panel classes, and their methods:

help statgen                           % package overview
help statgen.load_reference            % function help
help statgen.ReferencePanel            % class help
help statgen.ReferencePanel.select_shards
help statgen.LDPanel.multiply_r2       % method help

Octave is supported for small-scale testing and CI compatibility checks. Use MATLAB for production-scale MATLAB workflows, especially large LD .mat distributions. In Octave, the package overview is available with help statgen.Contents.

Key APIs

ReferencePanel

load_reference(path) loads a .bim file or an @-sharded .bim pattern.
panel.chr, panel.snp, panel.bp, panel.a1, and panel.a2 expose reference-coordinate variant fields.
panel.select_shards(shards) subsets by shard label.
panel.is_object_compatible(other) checks reference compatibility and returns false on mismatch rather than raising.
panel.save_cache(path) and load_reference_cache(...) save and reload reference caches.

LDPanel

load_ld(path, shards=None, default_chrX_sex=None) loads a sparse LD distribution and its manifest-declared reference cache.
load_ld_reference(path, shards=None) loads only the manifest-declared reference cache from an LD distribution root.
validate_ld_distribution(path) checks manifest, checksums, bundled references, and optionally sparse payload structure.
panel.a1freq(chrX_sex=None) returns reference-aligned allele frequencies.
panel.multiply_r2(M, chrX_sex=None) multiplies by LD r².
panel.select_shards(shards) subsets by shard label.
fast_prune(logpvec, ld_panel, r2_threshold=0.2, chrX_sex=None) performs LD pruning using aligned scores.

Building LD Distributions

The LD workflow intentionally differs slightly by runtime:

Python builds ld_npz/
R prepares ld_npz/ in place
MATLAB/Octave converts ld_npz/ to ld_mat/

ld_npz/ is the canonical build and interchange format. R reads its .npz shards directly after adding reference_cache.rds; with extract_npz = TRUE, R can also create sibling *.npz.d extracted caches for faster repeated loads. MATLAB/Octave uses a separate ld_mat/ directory because its runtime sparse matrix representation is stored in .mat files.

python script/statgen_build_ld.py ... --shard SHARD builds one LD shard.
python script/statgen_create_ld_manifest.py --ld PATH finalizes a Python LD distribution.
R loads the Python .npz LD distribution directly after one R-specific preparation step: statgen::prepare_ld_npz_for_r(npz_root). Use extract_npz = TRUE to create sibling *.npz.d extracted caches for faster repeated R loads.
statgen.convert_ld_npz_to_mat(input_root, output_root, shard) converts a Python LD shard for MATLAB.
statgen.create_ld_mat_manifest(input_root, output_root, shards) finalizes a MATLAB LD distribution after conversion.

AnnotationPanel

load_annotations(bed_paths, reference) paints BED intervals onto a ReferencePanel; BED basenames become unique annotation names.
create_annotations(reference, annomat, annonames) creates an annotation panel from an already aligned matrix, not raw BED input.
panel.annomat exposes the reference-aligned annotation matrix.
panel.select_shards(shards) and panel.select_annotations(names) subset an annotation panel.
panel.union_annotations(other, mode="by_name") combines annotation panels.
panel.save_cache(path) and load_annotations_cache(...) save and reload painted annotation caches.

GenotypePanel

load_genotype(bfile_prefix, reference) loads PLINK 1 genotype metadata and aligns it to a ReferencePanel. bfile_prefix may be a single prefix or an @-sharded prefix.
panel.fetch_genotypes(snp_indices) returns selected genotype calls without loading the full BED file. It defaults to raw PLINK-decoded calls; use the ploidy_scaled haploid mode to map male chrX ploidy 1 calls from 0/2 to 0/1. In MATLAB, snp_indices are one-based panel SNP indices; requested SNPs must be present in the genotype source.
panel.fetch_genotypes_int8(snp_indices) returns selected hardcalls as compact integer calls.
panel.is_present is a boolean mask in reference coordinates for variants present after alignment.
panel.fid, panel.iid, panel.sex, panel.is_male, and panel.is_female expose sample metadata.
panel.save_cache(path) and load_genotype_cache(...) save and reload genotype metadata caches for faster repeated loading.

Sumstats

load_sumstats(path, reference) loads a .tsv.gz summary-statistics file and aligns rows by chr:bp:a1:a2.
Input files require p; z and n are optional and warn when absent or incomplete. Common aliases such as POS, EffectAllele, and OtherAllele are accepted.
create_sumstats(reference, pvec, zvec=None, nvec=None, ...) creates a Sumstats object from already aligned vectors, not raw TSV input.
sumstats.logpvec, sumstats.zvec, and sumstats.nvec expose aligned vectors. Optional vectors are None in Python or empty in MATLAB when absent.
sumstats.is_present is a boolean mask in reference coordinates for variants present after alignment.
sumstats.select_shards(shards) subsets by shard label.
sumstats.save_cache(path) and load_sumstats_cache(...) save and reload aligned summary-statistics caches.

License

statgen is licensed under the MIT License. Contributions are accepted under the same license. See LICENSE and CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
R-package		R-package
dev		dev
docs		docs
matlab		matlab
python		python
script		script
spec		spec
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

statgen

Where to Start

Object Overview

Input Layouts

Concepts That Matter

Setup

Python

MATLAB

Key APIs

ReferencePanel

LDPanel

Building LD Distributions

AnnotationPanel

GenotypePanel

Sumstats

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

statgen

Where to Start

Object Overview

Input Layouts

Concepts That Matter

Setup

Python

MATLAB

Key APIs

ReferencePanel

LDPanel

Building LD Distributions

AnnotationPanel

GenotypePanel

Sumstats

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages