statgen provides Python, MATLAB/Octave, and R tools for working with
reference-aligned statistical genetics data: reference variants, genotypes, LD,
annotations, and GWAS summary statistics.
All panel objects use reference coordinates. Genome build, contig naming, and allele orientation should be resolved upstream by genomatch or an equivalent pipeline.
- Preparing inputs and caches from PLINK, summary statistics, and BED files: Tutorial 1 builds an LD distribution and saves Python and MATLAB caches from source files.
- Running Python analyses: Tutorial 2 runs LD operations, prunes variants, and fetches genotypes.
- Running MATLAB analyses: Tutorial 3 runs the same analysis workflow in MATLAB.
- Running R analyses: Tutorial 4 runs the same workflow with the R package.
- Looking up MATLAB API help interactively:
help statgen. - Finding installation commands, core concepts, or the API map: Input Layouts, Concepts That Matter, Setup, and Key APIs.
- Following changes or getting help: Changelog and issue tracker.
- Developer docs for implementing or changing
statgenbehavior: spec/SPEC.md.
| Object | Meaning | Typical input |
|---|---|---|
ReferencePanel |
Ordered reference variant table | External PLINK .bim, either one file or an @-sharded path |
LDPanel |
Sparse LD distribution tied to a matching reference | Built LD distribution directory produced by statgen from PLINK bfiles |
AnnotationPanel |
Binary annotation matrix painted onto the reference | External BED files |
GenotypePanel |
Reference-aligned PLINK genotype metadata with on-demand hardcall access | External PLINK 1 bfile prefix, either one bfile or an @-sharded prefix |
Sumstats |
One aligned summary-statistics trait or source | External .tsv.gz with chr, bp, a1, a2, p, and optional columns such as beta, se, z, and n |
- Reference input may be one
.bimfile or an@-sharded.bimtemplate. - Genotype input may be one PLINK bfile prefix or an
@-sharded bfile prefix. - LD input is a distribution directory with
ld_manifest.json, not a single shard file or an@template. - Sumstats input is one TSV/TSV.GZ file per trait or source, not a sharded input.
- Annotation input is one BED file per annotation column; multiple BED files are annotations, not chromosome shards.
- Cache loaders read one cache file and may optionally return a logical shard subset.
- All analysis objects are aligned to a
ReferencePanel. - Supported contig labels are
1-22andX; labels are not normalized, sochr1/chrXinputs should be fixed upstream. - LD is a directory distribution with an
ld_manifest.jsonfile, not a normal object cache. - LD should be built from an unrelated, single-ancestry genotype reference panel appropriate for the downstream analyses.
- Python builds the canonical
ld_npz/distribution. R prepares that same.npzdistribution in place with an R reference-cache sidecar and optional extracted shard caches. MATLAB/Octave uses a separateld_mat/distribution converted from the Python-built LD shards because it needs native sparse.matfiles. - LD panels preserve the full reference SNP axis. If a forced LD build contains monomorphic SNPs, LD involving those SNPs is undefined and represented by omitted sparse entries.
- Genotype caches store metadata and source BED paths, not dense genotypes.
- chrX LD sex-specific shards use FAM sex labels.
- Genotype
.ploidysidecars belong to genotype access, not LD reference inputs. - Missing genotype
.ploidysidecars imply(2, 2)ploidy for autosomes and(1, 2)for chrX, with a warning for chrX. Provide.ploidyfor PAR/non-PAR mixtures or nonstandard chrX encodings. - Shard objects are internal; users work with panel objects and
select_shards.
Install from a cloned repository:
pip install -e python/Then import APIs from the relevant submodule, for example:
from statgen.reference import load_reference
from statgen.sumstats import load_sumstatsFrom a cloned repository, add the MATLAB package folder to the path:
addpath('/path/to/statgen/matlab')Alternatively, download the MATLAB bundle from a GitHub release and add that
bundle's matlab folder to the path.
MATLAB help text is available for public functions, public panel classes, and their methods:
help statgen % package overview
help statgen.load_reference % function help
help statgen.ReferencePanel % class help
help statgen.ReferencePanel.select_shards
help statgen.LDPanel.multiply_r2 % method helpOctave is supported for small-scale testing and CI compatibility checks. Use
MATLAB for production-scale MATLAB workflows, especially large LD .mat
distributions. In Octave, the package overview is available with
help statgen.Contents.
load_reference(path)loads a.bimfile or an@-sharded.bimpattern.panel.chr,panel.snp,panel.bp,panel.a1, andpanel.a2expose reference-coordinate variant fields.panel.select_shards(shards)subsets by shard label.panel.is_object_compatible(other)checks reference compatibility and returnsfalseon mismatch rather than raising.panel.save_cache(path)andload_reference_cache(...)save and reload reference caches.
load_ld(path, shards=None, default_chrX_sex=None)loads a sparse LD distribution and its manifest-declared reference cache.load_ld_reference(path, shards=None)loads only the manifest-declared reference cache from an LD distribution root.validate_ld_distribution(path)checks manifest, checksums, bundled references, and optionally sparse payload structure.panel.a1freq(chrX_sex=None)returns reference-aligned allele frequencies.panel.multiply_r2(M, chrX_sex=None)multiplies by LDr².panel.select_shards(shards)subsets by shard label.fast_prune(logpvec, ld_panel, r2_threshold=0.2, chrX_sex=None)performs LD pruning using aligned scores.
The LD workflow intentionally differs slightly by runtime:
Python builds ld_npz/
R prepares ld_npz/ in place
MATLAB/Octave converts ld_npz/ to ld_mat/
ld_npz/ is the canonical build and interchange format. R reads its .npz
shards directly after adding reference_cache.rds; with extract_npz = TRUE,
R can also create sibling *.npz.d extracted caches for faster repeated loads.
MATLAB/Octave uses a separate ld_mat/ directory because its runtime sparse
matrix representation is stored in .mat files.
python script/statgen_build_ld.py ... --shard SHARDbuilds one LD shard.python script/statgen_create_ld_manifest.py --ld PATHfinalizes a Python LD distribution.- R loads the Python
.npzLD distribution directly after one R-specific preparation step:statgen::prepare_ld_npz_for_r(npz_root). Useextract_npz = TRUEto create sibling*.npz.dextracted caches for faster repeated R loads. statgen.convert_ld_npz_to_mat(input_root, output_root, shard)converts a Python LD shard for MATLAB.statgen.create_ld_mat_manifest(input_root, output_root, shards)finalizes a MATLAB LD distribution after conversion.
load_annotations(bed_paths, reference)paints BED intervals onto aReferencePanel; BED basenames become unique annotation names.create_annotations(reference, annomat, annonames)creates an annotation panel from an already aligned matrix, not raw BED input.panel.annomatexposes the reference-aligned annotation matrix.panel.select_shards(shards)andpanel.select_annotations(names)subset an annotation panel.panel.union_annotations(other, mode="by_name")combines annotation panels.panel.save_cache(path)andload_annotations_cache(...)save and reload painted annotation caches.
load_genotype(bfile_prefix, reference)loads PLINK 1 genotype metadata and aligns it to aReferencePanel.bfile_prefixmay be a single prefix or an@-sharded prefix.panel.fetch_genotypes(snp_indices)returns selected genotype calls without loading the full BED file. It defaults to raw PLINK-decoded calls; use theploidy_scaledhaploid mode to map male chrX ploidy1calls from0/2to0/1. In MATLAB,snp_indicesare one-based panel SNP indices; requested SNPs must be present in the genotype source.panel.fetch_genotypes_int8(snp_indices)returns selected hardcalls as compact integer calls.panel.is_presentis a boolean mask in reference coordinates for variants present after alignment.panel.fid,panel.iid,panel.sex,panel.is_male, andpanel.is_femaleexpose sample metadata.panel.save_cache(path)andload_genotype_cache(...)save and reload genotype metadata caches for faster repeated loading.
load_sumstats(path, reference)loads a.tsv.gzsummary-statistics file and aligns rows bychr:bp:a1:a2.- Input files require
p;zandnare optional and warn when absent or incomplete. Common aliases such asPOS,EffectAllele, andOtherAlleleare accepted. create_sumstats(reference, pvec, zvec=None, nvec=None, ...)creates aSumstatsobject from already aligned vectors, not raw TSV input.sumstats.logpvec,sumstats.zvec, andsumstats.nvecexpose aligned vectors. Optional vectors areNonein Python or empty in MATLAB when absent.sumstats.is_presentis a boolean mask in reference coordinates for variants present after alignment.sumstats.select_shards(shards)subsets by shard label.sumstats.save_cache(path)andload_sumstats_cache(...)save and reload aligned summary-statistics caches.
statgen is licensed under the MIT License. Contributions are accepted under
the same license. See LICENSE and CONTRIBUTING.md.