forked from BenjaminPeter/admixfrog
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsetup.py
More file actions
55 lines (49 loc) · 41.7 KB
/
setup.py
File metadata and controls
55 lines (49 loc) · 41.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# -*- coding: utf-8 -*-
from setuptools import setup
packages = \
['admixfrog',
'admixfrog.frog',
'admixfrog.gll',
'admixfrog.slug',
'admixfrog.utils',
'admixfrog.utils.pgdirect']
package_data = \
{'': ['*']}
install_requires = \
['Cython>=3.0',
'numba>=0.53',
'numpy>=1.23',
'pandas>=2.0',
'pybedtools>=0.10',
'pysam>=0.15.0',
'pyyaml>=5.0',
'scipy>=1.1',
'setuptools>=75']
entry_points = \
{'console_scripts': ['admixfrog = admixfrog:run_frog',
'admixfrog-bam = admixfrog:bam',
'admixfrog-bam2 = admixfrog:bam2',
'admixfrog-profile = admixfrog:profile_frog',
'admixfrog-ref = admixfrog:do_ref',
'admixfrog-rle = admixfrog:do_rle',
'admixslug = admixfrog:run_sfs',
'admixslug-profile = admixfrog:profile_slug']}
setup_kwargs = {
'name': 'admixfrog',
'version': '0.7.4.post1.dev0+8e81ee7',
'description': 'Local Ancestry Inference for low-coverage ancient DNA data',
'long_description': '# Admixfrog\nAdmixfrog is a HMM to infer ancestry frogments (fragments) from low-coverage, contaminated data. \n\nBriefly, we try to fit the allele frequency at each genomic position in a _target_ by\ncomparing it with a number of _sources_. In the motivating example, the target would be a\nmodern human, and the sources would be modern humans (AFR), Neandertals (NEA) or\nDenisovans (DEN).\n\nWe fit a hidden Markov Model across the genome, with the hidden states being all possible\ncombinations of ancestry between one or two sources.\n\n## Installation\nRequires `python3.8+`\nInstall dependencies:\n```\npip install cython scipy --upgrade\n```\nInstall `admixfrog` (from github):\n```\npip install git+https://github.com/benjaminpeter/admixfrog@0.7.3\n```\n\nInstall `admixfrog` (from source directory):\n```\npip install .\n```\n\n## Data\nAdmixfrog requires (binary-only) eigenstrat data, vcf and bam-files. Supplementary files are typically in yaml-format. The bam-file is used for the _target_, individual, if genotypes are unknown. If genotypes are known, they can be specified in either the eigenstrat or vcf format. In addition, a set of references are required. These too are specified in the reference.\n\n## Quickstart\nTo get things started, consider an analysis where we would like to learn to local Human, Neandertal and Denisovan ancestry of the Oase1 specimen:\n\n```admixfrog --gfile data/oase --target Oase1_d --states NEA=Vindija.DG+Altai.DG YRI=Yoruba.DG Denisova.DG --cont YRI --out quickstart```\nthis will do the following:\n\n1. `--gfile data/oase`: read the file data/oase.geno|snp|ind (eigenstrat-format)\n2. `--target Oase1_d`: declare that we would use the sample named `Oase1_d` as the target\n3. with `--states NEA=Vindija.DG+Altai.DG YRI=Yoruba.DG Denisova.DG` we declare the three sources: a) combine the Vindija and Altai populations from the file (third column in the `.ind`) file into a population named NEA, b) use the population Yoruba.DG, but rename it to YRI and c) Denisova.DG is the third possible source\n4. `--cont YRI` designates YRI as a proxy for the contaminant. If there is no contamination, estimating it can be disabled using the `--c0 0 --dont-est-contamination` flags.\n5. `--out quickstart`: a prefix for all output files\n\n## Running the program\nFor most analyses, it is often useful to generate the reference-file and target-file before running the main analysis. This is because parsing these files is quite time-consuming, and is not needed for replicate analyses. However, this is not required, and the program will perform all steps automatically if required\n\n\nThus, we might run these three commands:\n```\nmkdir res/\nadmixfrog-ref --out res/ref_example.xz --vcf-ref data/oase.vcf.gz \\\n --state-file data/pops.yaml \\\n --rec-file data/maps_chr.9 \\\n --states AFR NEA=Altai_snpAD.DG \\\n --map-id AA_Map deCODE COMBINED_LD \\\n --default-map AA_Map \\\n --chroms 9 \n```\nTo create the target file, we might run\n```\nadmixfrog-bam --bam data/oase_chr9.bam --ref ref_example.xz --out oase_example.in.xz \n```\nand finally, the analysis can be run using\n```\nadmixfrog --infile oase_example.in.xz --ref ref_example.xz --out example1 -b 10000 \\\n --states AFR NEA --contamination AFR\n```\n\nThee most useful command is `admixfrog --help` that will give an up-to-date summary of all the parameters.\n\n\n\nthere are a few optional parameters, the most important are\n - `-b` the bin size (in 10^6cM), when using a recombination map, or in bp when running without (using `-P`)\n - `--ancestral`: a taxon in that specifies the ancestral allele (must be in the\n reference file)\n - `--states`: the potential admixture sources. (Must be in the reference)\n - `--contamination`: the source of contamination. (Must also be in the reference)\n\nFor other parameters, see below or type `admixfrog --help`\n\nThere are also utilities to create the input file (from a bam file ) and the reference file (from a vcf file) \nfrom standard formats. These can be called using `admixfrog-bam` or\n`admixfrog-ref`, respectively. Their arguments are also accepted by the main \n`admixfrog` program. However, as parsing and creating these files takes typically much\nlonger than running admixfrog, I recommend generating them first.\n\n\nThe input file is optionally generated from a bam-file:\n```\nadmixfrog --bamfile {x}.bam --ref {y}.ref.xz --out {z} -b 10000 --ancestral PAN --states AFR NEA DEN\n```\nbut this takes quite long for high-coverage genomes.\n\n### Creating the Reference File:\nThe input file for `admixfrog` can be created from an (indexed) vcf-file using the \n`admixfrog-ref` subprogram:\n```bash\n admixfrog-ref --vcf x_{CHROM}.vcf.gz --out x.ref.xz \\\n --states AFR VIN=Vindija33.19 DEN=Denisova \\\n --pop-file data.yaml \\\n --rec-file rec.{CHROM}\n```\n\nThe options are:\n - `--vcf` : an indexed file in vcf format. Non-biallelic variants are\n skipped,but everything else is used. Hence, filtering should be done on this file. Use the wildcard `{CHROM}` if files are split by chromosome\n - `--out` : the name of the output file\n - `--states` : the names of the states, which will be used as sources of\n admixture, contamination and ancestral alleles. By convention I use\n all-caps, 3-4 letter abbreviations. There are three possibilities:\n\n 1. a population define in the `pop file`\n 2. a sample name from the vcf file. This will create a single-sample\n reference with the same name as the sample\n 3. a string of the form `NEA=Altai,Vindija33.19`. This will create a \n reference named NEA from the samples `Altai` and `Vindija33.19`\n\n - `--pop-file`: A `yaml`-format file that defines\n which samples are in which population, and which samples are\n (pseudo)-haploid\n\n - `--rec-file` A file specifying the recombination map. I use the file from here: [https://www.well.ox.ac.uk/~anjali/AAmap/](https://www.well.ox.ac.uk/~anjali/AAmap/)\n\n#### File Format Specification\nThe reference file has the following columns:\n- `chrom` is the chromosome (or contig) id\n- `pos` is the physical position of this chromosome\n- `ref`, `alt` are the two alleles present at this locus\n- `map`, is the genetic position (in cM)\n- a number of pairs of `{ID}_alt`, `{ID}_ref` that give the number of \n non-reference and reference alleles observed for reference `{ID}`,\nrespectively.\n\n```\n chrom,pos,ref,alt,map,AFK_alt,AFR_alt,ALT_alt,CHA_alt,DEN_alt,EAS_alt,EUR_alt,NEA_alt,PAN_alt,UST_alt,VIN_alt,AFK_ref,AFR_ref,ALT_ref,CHA_ref,DEN_ref,EAS_ref,EUR_ref,NEA_ref,PAN_ref,UST_ref,VIN_ref\n 1,570094,G,A,0,20,0,0,0,2,0,0,0,2,0,0,394,2,2,2,0,10,36,6,0,2,2\n 1,714019,A,G,0,168,11,2,2,2,0,0,6,2,0,2,246,27,0,0,0,54,118,0,0,2,0\n 1,724289,C,A,0,0,0,1,0,0,0,0,1,0,0,0,414,80,1,2,2,94,148,5,2,2,2\n 1,724290,A,C,0,0,0,1,0,0,0,0,1,0,0,0,414,80,1,2,2,94,148,5,2,2,2\n 1,725389,C,T,0,5,0,2,2,1,0,0,6,2,0,2,409,0,0,0,1,0,0,0,0,2,0\n```\n\n#### Population file format\nI use `yaml`-formatted files to define populations, as they are an easily\nreadable data storage format. The format specification is as follows:\nThe `sampleset`-section defines sources. For example, below we make a source\npanel containing the two Neandertals (AltaiNeandertal and Vindija33.19), and a \nsource named `EUR` containing three individuals from the SGDP data set. Finally,\nI create a panel named `ANC` which contains the aligned chimp (`panTro4`)\nsequence.\n\nIn addition, I designate two samples (`panTro4` and `Denisova11`) as\npseudo-haploid by listing them under `pseudo_haploid`. For the outgroup\n`panTro4`, this is because we do not care about within-chimp variation, and for\nDenisova 11, because it is a low-coverage genome and we cannot get confident\ngenotype calls.\n\n \n```yaml\n sampleset: \n NEA: \n - AltaiNeandertal\n - Vindija33.19 \n EUR: \n - "B_Crete-1" \n - "B_Crete-2" \n - "B_French-3" \n ANC: \n - panTro4 \n \n pseudo_haploid: \n - Denisova11\n - panTro4 \n```\n\n\n### Creating the Input File:\nThe input file for `admixfrog` can be created from a bam-file using the \n`admixfrog-bam` subprogram:\n\n```\n admixfrog-bam --bam {x}.bam --ref {y}.ref.xz --deam-cutoff 3 --length-bin-size 35 --out {x}.in.xz\n```\nThis will create a file named `{x}.in.xz` in admixfrog input format from\n`{x}.bam`. The site will be ascertained on the sites in `{y}.ref.xz`. Reads with\na deamination (C-\\>T) in strand direction in the first 3 bases will be considered\nseparately for purposes of contamination estimations. Reads will also be binned\nin bins of size 35bp for contamination estimation.\n\n\n##### File Format\nthe infile has 5 mandatory columns, called `chrom`, `pos`, `tref` and `talt`. `lib` is optional.\n\nThe columns are\n\n - `chrom` is the chromosome (or contig) id\n - `pos` is the physical position of this chromosome\n - `lib` is a library/read group id. Reads are split by `lib` for contamination\n estimates\n - `tref`, `talt` are the number of refernce and non-reference reads observed for\n this position.\n\n```\n chrom,pos,lib,tref,talt\n 1,570094,L5733_0_deam,1,0\n 1,570094,R9873_0_deam,1,0\n 1,714019,R9880_2_nodeam,0,1\n 1,724289,L5736_0_nodeam,1,0\n 1,724289,L5736_1_nodeam,1,0\n 1,724289,L5734_0_nodeam,1,0\n 1,724290,L5736_0_nodeam,1,0\n 1,724290,L5736_1_nodeam,1,0\n 1,724290,L5734_0_nodeam,1,0\n```\n\n\n\n#### visualization\na simple viz is \n```R\n library(tidyverse)\n a = read_csv("admixfrog/5000/AFR_VIN_DEN/Papuan_archaicadmixture.bin.xz")\n a %>% gather(k, v, -chrom:-n_snps) %>% \n filter(k!="AFR", v>.1) %>%\n ggplot(aes(x=map, y=v, fill=k)) + geom_col() + \n facet_wrap(~chrom, ncol=1, strip=\'l\')\n```\n\n## Output\nThere are currently six output files. All of them are compressed with LZMA.\n - `*.cont.xz` : contamination estimates for each read group\n - `*.bin.xz` : posterior decoding for each bin along the genome\n - `*.snp.xz` : posterior genotype likelihoods for each SNP, taking contamination into\n acccount\n - `*.pars.yaml` : parameter estimates\n - `*.rle.xz` : called runs of ancesstry\n - `*.res.xz` : simulated runs of ancestry\n\n### Contamination estimates (admixfrog.cont.xz)\nThe contamination and error estimates are in an xz-compressed csv format and\nwill look like this:\n```\nlib,cont,error,rg,len_bin,deam,n_snps \nSR_nodeam,0.356971,0.010000,SR_nodeam,0,NA,467 \nSR_deam,0.000554,0.010000,SR_deam,0,NA,76 \n```\n\nEac row represents a subset of reads for which error and contamination rates estimated\nindependently. The columns are\n\n- `lib` : a unique string used to group reads. This can be any value, but the\n program tries to split the string according to the format `{rg}_{len_bin}_{deam}`.\nIf present in this way, the corresponding columns will be filled\n- `rg` : Read group\n- `len_bin` : Length-bin\n- `deam` : whether reads have a terminal deamination\n- `cont` : contamination estimate\n- `error` : sequencing error estimate\n- `n_snps `: how many reads are in this class\n\n\n### Posterior decoding (admixfrog.bin.xz)\nThe posterior decoding is in xz-compressed csv format and will look like this\n```\nchrom,map,pos,id,haploid,viterbi,n_snps,AFK,ARC,AFKARC \n9,200000.000000,281845,0,False,AFK,1,0.541009,0.208888,0.250103\n9,300000.000000,300000,1,False,AFK,0,0.540493,0.205282,0.254225\n9,400000.000000,400000,2,False,AFK,0,0.539910,0.200351,0.259739\n```\nEach row represents a bin used in the HMM-algorithm, and the columns are\n\n - `chrom`: chromosome of bin\n - `map` : map (genetic) coordinate of lower bin boundary\n - `pos` : physical coordinate of lower bin boundary\n - `id` : id of bin (unique number, starting from 0, ordered along chromosome)\n - `haploid` : flag set to True if bin is haploid\n - `viterbi` : Viterbi (Maximum-likelihood) decoding of bin state\n - `n_snps` : number of observed SNP present in bin\n\nthe remaining columns (`AFK`, `ARC`, `AFKARC` in the example) give the posterior probability\nfor the bin being in a given state. The number of columns will vary according to the references\nused, and their values sum up to 1. In the example, there are two homozygous states (`AFK`, `ARC`)\nand a heterozygous state `AFKARC`, designated by a concatenation of the two\nstrings.\n\n### Posterior genotype likelihood (admixfrog.snp.xz)\nResults by SNP. xz-compressed csv format.\n\n```\nchrom,pos,map,tref,talt,G0,G1,G2,p,bin \n9,281845,281845,1,0,-0.409389,-2.114613,-2.795105,0.005443,0 \n9,635998,635998,0,1,-1.744065,-0.723350,-0.713062,0.288156,4 \n9,660473,660473,1,0,-0.401219,-2.487784,-3.318218,0.002107,4 \n9,1004958,1004958,0,1,-3.356600,-0.726711,-0.530971,0.388274,8 \n9,1463080,1463080,1,0,-0.361344,-2.485787,-4.174832,0.001701,12\n```\n\nEach row is a SNP\n\n - `chrom`: chromosome SNP is on\n - `pos`: physical position of SNP\n - `map`: genetic position of SNP\n - `tref`: number of reference reads at SNP\n - `talt`: number of alt reads at SNP\n - `G0,G1,G2`: log10-likelihood of SNP state 0, 1, 2\n - `p`: estimated allele frequency of derived allele\n - `bin`: bin-id this SNP is in\n\n### Posterior samples (admixfrog.res.xz)\nSamples of the posterior given the learned parameters and data are given in xz-compressed csv format and will look like this\n```\nlen,start,end,state,it,chrom \n1,0,1,AFK,0,9 \n1,0,1,AFK,0,9 \n16,6,22,AFK,0,9 \n8,26,34,AFK,0,9 \n7,36,43,AFK,0,9 \n5,1,6,ARC,0,9 \n25,1,26,ARC,0,9 \n2,34,36,ARC,0,9 \n1,43,44,ARC,0,9 \n\n```\nEach row represents a segment in the same state, and the columns are:\n- `len` : Length (in bins) of segment\n- `start` : Start(id) of segment\n- `end` : End(id) of segment\n- `state` : State of segment\n- `it` : iteration / sample number of posterior sample\n- `chrom` : chromosome sampled\n\nFor example, the above snipped designates the 0th iteration of chromosome 9, \nthe first bin is homozyogus for the `AFK` state, then two segments, one 5 bins\nlong, one 25 bins long, start in the `ARC` state.\n\n### Estimated introgressed fragments (admixfrog.rle.xz)\n\nCalled introgressed tracts. Calls are done in two formats: \n1. `state` refers to calls where tracts are continued regardless whether they\n are homozygous or heterozygous\n2. `het` and `homo` designate runs that are strictly heterozygous or homozygous\n\n```\nchrom,start,end,score,target,type,map,pos,id,map_end,pos_end,id_end,len,map_len,pos_len,nscore\n9,154,156,0.145364,AFKARC,het,15600000.000000,15600000,154,15800000.000000,15800000,156,2,200000.000000,200000,0.072682\n9,1216,1404,28.729700,AFK,state,121800000.000000,121800000,1216,140600000.000000,140600000,1404,188,18800000.000000,18800000,0.152818\n9,1187,1193,0.223771,AFK,state,118900000.000000,118900000,1187,119500000.000000,119500000,1193,6,600000.000000,600000,0.037295\n9,250,919,78.011711,AFK,state,25200000.000000,25200000,250,92100000.000000,92100000,919,669,66900000.000000,66900000,0.116609\n```\nEach row represents a segment in the same state, and the columns are:\n- `chrom` : Chromosome the segment is on\n- `score`, `nscore` : Numerical score giving certainty of fragment,\n unnormalized or normalized by bin size\n- `target` : iteration / sample number of posterior sample\n- `map_start` `map_end`, `map_len` : start, end and length in genetic map\n- `pos_start` `pos_end`, `pos_len` : start, end and length in physical map\n- `start`, `end`, `len` : start, end and length in Bin id\n- `type` : type of segment call (zygosity vs simple state)\n- `target` : target state for the segment\n\n\n### Other parameters (admixfrog.pars.yaml)\nIn yaml format\n\n- `gamma_names`: names of states. All other parameters are given in this order\n- `F`, `tau`: estimates of drift parameters per homozygous state\n- `alpha0, alpha0_hap`: stationary probabilities for diploid and haploid states,\n respectively\n- `trans`, `trans_hap`: diploid and haploid tranition probability\n- `error` : error estimates\n- `cont`: contamination estimates\n- `sex` : assumed sex of individual\n\n\n\n## Documentation\nFull documentation is not yeat available, this is a dump of the help file for now.\nChanges are that `admixfrog --help` will give more up-to-date info\n\nFor the detailed description of the algorithm, see [docs/admixfrog.pdf](docs/admixfrog.pdf)\n\n\n\n\n## Contact\nBenjamin Peter [benjamin_peter@eva.mpg.de](benjamin_peter@eva.mpg.de)\n\n```\nusage: admixfrog [-h] [-v] [--target-file TARGET_FILE] [--ref REF_FILES]\n [--filter-delta FILTER_DELTA] [--filter-pos FILTER_POS]\n [--filter-map FILTER_MAP] [--male] [--female]\n [--bamfile BAMFILE] [--force-target-file]\n [--deam-cutoff DEAM_CUTOFF] [--minmapq MINMAPQ]\n [--length-bin-size LENGTH_BIN_SIZE] [--vcfgt VCFGT]\n [--target TARGET] [--geno-file GENO_FILE] [--guess-ploidy]\n [--dont-est-contamination] [--est-error]\n [--freq-contamination FREQ_CONTAMINATION] [--est-F]\n [--est-tau] [--freq-F FREQ_F] [--est-inbreeding]\n [--F0 [F0 [F0 ...]]] [--tau0 [TAU0 [TAU0 ...]]] [--e0 E0]\n [--c0 C0] [--gt-mode] [-b BIN_SIZE] [--prior PRIOR] [-P]\n [--max-iter MAX_ITER] [--ll-tol LL_TOL] [--dont-split-lib]\n [--autosomes-only] [--downsample DOWNSAMPLE]\n [--init-guess [INIT_GUESS [INIT_GUESS ...]]]\n [--vcf-ref VCF_REF] [--rec-file REC_FILE]\n [--rec-rate REC_RATE] [--pos-id POS_ID] [--map-id MAP_ID]\n [--chroms CHROMS] [--force-ref] [--run-penalty RUN_PENALTY]\n [--n-post-replicates N_POST_REPLICATES] [--outname OUTNAME]\n [--no-rle] [--no-snp] [--no-bin] [--no-cont] [--no-rsim]\n [--no-pars] [--states [STATES [STATES ...]]]\n [--state-file STATE_FILE] [--cont-id CONT_ID]\n [--ancestral ANCESTRAL]\n\nInfer admixture frogments from low-coverage and contaminated genomes\n\noptional arguments:\n -h, --help show this help message and exit\n -v, --version show program\'s version number and exit\n --target-file TARGET_FILE, --infile TARGET_FILE, --in TARGET_FILE\n Sample input file (csv). Contains individual specific\n data, obtained from a bam file. - Fields are chrom,\n pos, map, lib, tref, talt" - chrom: chromosome - pos :\n physical position (int) - map : rec position (float) -\n lib : read group. Any string, same string assumes same\n contamination - tref : number of reference reads\n observed - talt: number of alt reads observed\n --ref REF_FILES, --ref-file REF_FILES\n refernce input file (csv). - Fields are chrom, pos,\n ref, alt, map, X_alt, X_ref - chrom: chromosome - pos\n : physical position (int) - ref : refrence allele -\n alt : alternative allele - map : rec position (float)\n - X_alt, X_ref : alt/ref alleles from any number of\n sources / contaminant populations. these are used\n later in --cont-id and --state-id flags\n --filter-delta FILTER_DELTA\n only use sites with allele frequency difference bigger\n than DELTA (default off)\n --filter-pos FILTER_POS\n greedily prune sites to be at least POS positions\n apart\n --filter-map FILTER_MAP\n greedily prune sites to be at least MAP recombination\n distance apart\n --male Assumes haploid X chromosome. Default is guess from\n coverage. currently broken\n --female Assumes diploid X chromosome. Default is guess from\n coverage\n --vcfgt VCFGT, --vcf-gt VCFGT, --vcf-target_file VCFGT\n VCF input file. To generate input format for admixfrog\n in genotype mode, use this.\n --target TARGET, --sample-id TARGET\n sample id if target is read from vcf or geno file. No\n effect for bam-file\n --chroms CHROMS, --chromosome-files CHROMS\n The chromosomes to be used in vcf-mode.\n --states [STATES [STATES ...]], --state-ids [STATES [STATES ...]]\n the allowed sources. The target will be made of a mix\n of all homozygous and heterozygous combinations of\n states. More than 4 or 5 sources have not been tested\n and are not recommended. Must be present in the ref\n file\n --state-file STATE_FILE, --pop-file STATE_FILE\n Population assignments (yaml format)\n --cont-id CONT_ID, --cont CONT_ID\n the source of contamination. Must be specified in ref\n file\n --ancestral ANCESTRAL, -a ANCESTRAL\n Outgroup population with the ancestral allele. By\n default, assume ancestral allele is unknown\n\nbam parsing:\n --bamfile BAMFILE, --bam BAMFILE\n Bam File to process. Choose this or target_file. The\n resulting input file will be writen in {out}.in.xz, so\n it doesn\'t need to be regenerated. If the input file\n exists, an error is generated unless --force-target-\n file is set\n --force-target-file, --force-bam, --force-infile\n --deam-cutoff DEAM_CUTOFF\n reads with deamination in positions < deam-cutoff are\n considered separately\n --minmapq MINMAPQ reads with mapq < MINMAPQ are removed\n --length-bin-size LENGTH_BIN_SIZE\n if set, reads are binned by length for contamination\n estimation\n\ngeno (Eigenstrat/Admixtools/Reich) format\n parser options:\n --geno-file GENO_FILE, --gfile GENO_FILE\n geno file name (without extension, expects\n .snp/.ind/.geno files). Only reads binary format for\n now\n --guess-ploidy guess ploidy of individuals (use if e.g. random read\n sample inds are present)\n\noptions that control estimation of model\n parameters:\n --dont-est-contamination\n Don\'t estimate contamination (default do)\n --est-error estimate sequencing error per rg\n --freq-contamination FREQ_CONTAMINATION, --fc FREQ_CONTAMINATION\n update frequency for contamination/error (default 1)\n --est-F, -f Estimate F (distance from ref, default False)\n --est-tau, -tau Estimate tau (population structure in references)\n --freq-F FREQ_F, --f FREQ_F\n update frequency for F (default 1)\n --est-inbreeding, -I allow haploid (i.e. inbreed) stretches. Experimental\n --F0 [F0 [F0 ...]] initial F (should be in [0;1]) (default 0)\n --tau0 [TAU0 [TAU0 ...]]\n initial log-tau (default 0), at most 1 per source\n --e0 E0, -e E0 initial error rate\n --c0 C0, -c C0 initial contamination rate\n\noptions that control the algorithm behavior:\n --gt-mode, --gt Assume genotypes are known.\n -b BIN_SIZE, --bin-size BIN_SIZE\n Size of bins. By default, this is given in 1e-8 cM, so\n that the unit is approximately the same for runs on\n physical / map positions\n --prior PRIOR, -p PRIOR\n Prior of reference allele frequencies. If None\n (default, recommended), this is estimated from the\n data This number is added to both the ref and alt\n allele count for each reference, to reflect the\n uncertainty in allele frequencies from a sample. If\n references are stationary with size 2N, this is\n approximately [\\sum_i^{2N}(1/i) 2N]^{-1}.\n -P, --pos-mode Instad of recombination distances, use physical\n distances for binning\n --max-iter MAX_ITER, -m MAX_ITER\n maximum number of iterations\n --ll-tol LL_TOL stop EM when DeltaLL < ll-tol\n --dont-split-lib estimate one global contamination parameter (default:\n one per read group)\n --autosomes-only Only run autosomes\n --downsample DOWNSAMPLE\n downsample coverage to a proportion of reads\n --init-guess [INIT_GUESS [INIT_GUESS ...]]\n init transition so that one state is favored. should\n be a state in --state-ids\n\ncreating reference file:\n --vcf-ref VCF_REF, --vcf VCF_REF\n VCF File to process. Choose this or reffile. The\n resulting ref file will be writen as {out}.ref.xz, so\n it doesn\'t need to be regenerated. If the input file\n exists, an error is generated unless --force-ref is\n set\n --rec-file REC_FILE, --rec REC_FILE\n Recombination rate file. Modelled after\n https://www.well.ox.ac.uk/~anjali/AAmap/ If file is\n split by chromosome, use {CHROM} as wildcards where\n the chromosome id will be included\n --rec-rate REC_RATE Constant recombination rate (per generation per base-\n pair)\n --pos-id POS_ID column name for position (default: Physical_Pos)\n --map-id MAP_ID column name for genetic map (default: AA_Map)\n --force-ref, --force-vcf\n\ncall introgressed fragments:\n --run-penalty RUN_PENALTY\n penalty for runs. Lower value means runs are called\n more stringently (default 0.2)\n --n-post-replicates N_POST_REPLICATES\n Number of replicates that are sampled from posterior.\n Useful for parameter estimation and bootstrapping\n\noutput name and files to be generated:\n By default, all files are generated. However, if any of the --no-\\* options\n are used to disable specific files\n\n --outname OUTNAME, --out OUTNAME, -o OUTNAME\n Output file path (without extensions)\n --no-rle Disabble Estimating runs and writeing to file with\n extension .rle.xz\n --no-snp Disable writing posterior genotype likelihood to file\n with extension .snp.xz\n --no-bin Disable writing posterior states to file with\n extension .bin.xz\n --no-cont Disable writing contamination estimates to file with\n extension .bin.xz\n --no-rsim Disable writing posterior simulations of runs to file\n with extension .res.xz\n --no-pars Disable writing parameters to file with extension\n .pars.yaml\n```\n\n# Admixslug\n\nAdmixslug is a genotype likelihood method for contaminated low-coverage data. It\nworks by computing a conditional site-frequency spectrum. It uses mostly the same file\nformats as admixfrog and is therefore for now in the same repository.\n\nDocumentation is still under construction, but a typical command would be\n\n\n## Input files\nLike admixfrog, admixslug requires two input files;\n\n - a reference file with information from high-quality samples\n - a sample file that stores read information for a sample in compact format\n\n```\nadmixfrog-bam2 --ref ref/ref_bigsteffi.csv.xz --bamfile bams/bigsteffi/Broion.bam --out samples2/Broion_bigsteffi.in.xz --length-bin-size 1 \n```\nThe reference file is created exactly the same way as in admixfrog. The bamfile\ncontains the reads to be analyzed, and the `--out` flag designates where the\ninput file will be stored. see `admixfrog-bam2 --help` for details.\n\n\n## Quickstart\nThe following command runs admixslug on a single sample stored in\n`samples2/Brion_bigsteffi.in.xz` using the sites from `ref/ref_bigsteffi.csv.xz` \nand saving the output files in `admixslug/jk10/ALT_VIN_CHA_DEN/Broion_bigsteffi` \n\n```\nadmixslug --infile samples2/Broion_bigsteffi.in.xz \\\n --ref ref/ref_bigsteffi.csv.xz \\\n -o admixslug/jk10/ALT_VIN_CHA_DEN/Broion_bigsteffi \n --states ALT VIN CHA DEN \n --cont-id EUR \n --ancestral PAN \n --ll-tol 0.01 \n --ptol 0.001 \n --max-iter 100 \n --filter-pos 50 \n --filter-ancestral \n --len-bin 2000 \n --jk-resamples 10\n```\n\nThe remaining arguments are\n\n - `--states` : The reference samples or populations to condition the SFS on\n - `--cont-id` : The putative contaminant panel\n - `--ancestral` The ancestral state (these three need to be defined in the\n reference file)\n - `--ll-tol, -ptol`: Convergence criteria in terms of log-likelihood and\n changes in parameter values, respectively\n - `--max-iter` : The maximum number of iterations\n - `--filter-pos` : filter position to be at least x bases apart\n - `--filter-ancestral` : only retain sites with ancestral allele info\n - `--len-bin k `: attempt to bin reads into bins with around k sites. Higher\n numbers of k will result in fewer length-bins for contamination estimation,\n and lower numbers will result in many uncertain estimates\n - `--jk-resamples` : the nubmer of jackknife resamples for standard error\n estimation\n\n\n## Output\nThe main outputs are \n\n#### Contamination file\nThis file, named {out}.cont.xz contains contamination info.\n\n#### SFS file\nThis file, named {out}.sfs.xz contains info on the estimated SFS\n\n#### vcf-file\nThis file, named {out}.vcf contains a vcf file with i) random read samples, ii)\ngenotype likelihoods and iii) genotype probabilities for all sites with coverage\n\n#### snp-file\nThis file, named {out}.snp.xz contains similar info as the VCF file, but more\neasily readable in R\n\n\nFull command here:\n```\n usage: admixslug [-h] [-v] [--target-file TARGET_FILE] [--ref REF_FILES]\n [--filter-delta FILTER_DELTA] [--filter-pos FILTER_POS]\n [--filter-map FILTER_MAP] [--filter-high-cov FILTER_HIGH_COV]\n [--filter-ancestral] [--male] [--female] [--chroms CHROMS]\n [--vcf-sample-name VCF_SAMPLE_NAME] [--force-ref]\n [--seed SEED] [--bamfile BAMFILE] [--force-target-file]\n [--deam-cutoff DEAM_CUTOFF] [--minmapq MINMAPQ]\n [--min-length MIN_LENGTH] [--length-bin-size LENGTH_BIN_SIZE]\n [--report-alleles] [--vcfgt VCFGT] [--target TARGET]\n [--dont-est-contamination] [--dont-est-error]\n [--dont-est-bias] [--dont-est-F] [--est-tau]\n [--F0 [F0 [F0 ...]]] [--tau0 [TAU0 [TAU0 ...]]] [--e0 E0]\n [--b0 B0] [--c0 C0] [--max-iter MAX_ITER] [--ll-tol LL_TOL]\n [--ptol PTOL] [--dont-split-lib] [--autosomes-only]\n [--downsample DOWNSAMPLE]\n [--fake-contamination FAKE_CONTAMINATION]\n [--deam-bin-size DEAM_BIN_SIZE] [--len-bin-size LEN_BIN_SIZE]\n [--jk-resamples JK_RESAMPLES] [--outname OUTNAME] [--no-snp]\n [--no-cont] [--no-pars] [--no-sfs] [--no-vcf]\n [--states [STATES [STATES ...]]] [--state-file STATE_FILE]\n [--cont-id CONT_ID] [--ancestral ANCESTRAL]\n [--random-read-samples [RANDOM_READ_SAMPLES [RANDOM_READ_SAMPLES ...]]]\n\n Infer sfs and contamination from low-coverage and contaminated genomes\n\n optional arguments:\n -h, --help show this help message and exit\n -v, --version show program\'s version number and exit\n --target-file TARGET_FILE, --infile TARGET_FILE, --in TARGET_FILE\n Sample input file (csv). Contains individual specific\n data, obtained from a bam file. - Fields are chrom,\n pos, map, lib, tref, talt" - chrom: chromosome - pos :\n physical position (int) - map : rec position (float) -\n lib : read group. Any string, same string assumes same\n contamination - tref : number of reference reads\n observed - talt: number of alt reads observed\n --ref REF_FILES, --ref-file REF_FILES\n refernce input file (csv). - Fields are chrom, pos,\n ref, alt, map, X_alt, X_ref - chrom: chromosome - pos\n : physical position (int) - ref : refrence allele -\n alt : alternative allele - map : rec position (float)\n - X_alt, X_ref : alt/ref alleles from any number of\n sources / contaminant populations. these are used\n later in --cont-id and --state-id flags\n --filter-delta FILTER_DELTA\n only use sites with allele frequency difference bigger\n than DELTA (default off)\n --filter-pos FILTER_POS\n greedily prune sites to be at least POS positions\n apart\n --filter-map FILTER_MAP\n greedily prune sites to be at least MAP recombination\n distance apart\n --filter-high-cov FILTER_HIGH_COV, --filter-highcov FILTER_HIGH_COV\n remove SNP with highest coverage (default 0.001, i.e.\n 0.1% of SNP are removed)\n --filter-ancestral remove sites with no ancestral allele information\n --male Assumes haploid X chromosome. Default is guess from\n coverage. currently broken\n --female Assumes diploid X chromosome. Default is guess from\n coverage\n --chroms CHROMS, --chromosome-files CHROMS\n The chromosomes to be used in vcf-mode.\n --vcf-sample-name VCF_SAMPLE_NAME\n sample name to be used in admixslug\n --force-ref, --force-vcf\n --seed SEED random number generator seed for resampling\n --vcfgt VCFGT, --vcf-gt VCFGT, --vcf-target_file VCFGT\n VCF input file. To generate input format for admixfrog\n in genotype mode, use this.\n --target TARGET, --sample-id TARGET\n sample id if target is read from vcf or geno file. No\n effect for bam-file\n --no-sfs Disable output of sfs\n --no-vcf Disable output of vcf\n --states [STATES [STATES ...]], --state-ids [STATES [STATES ...]]\n the allowed sources. The target will be made of a mix\n of all homozygous and heterozygous combinations of\n states. More than 4 or 5 sources have not been tested\n and are not recommended. Must be present in the ref\n file\n --state-file STATE_FILE, --pop-file STATE_FILE\n Population assignments (yaml format)\n --cont-id CONT_ID, --cont CONT_ID\n the source of contamination. Must be specified in ref\n file\n --ancestral ANCESTRAL, -a ANCESTRAL\n Outgroup population with the ancestral allele. By\n default, assume ancestral allele is unknown\n --random-read-samples [RANDOM_READ_SAMPLES [RANDOM_READ_SAMPLES ...]], --pseudo-haploid [RANDOM_READ_SAMPLES [RANDOM_READ_SAMPLES ...]]\n Set a sample as a pseudo-haploid random-read sample\n for the reference. This means when creating a\n reference, only one allele is taken.\n\n bam parsing:\n --bamfile BAMFILE, --bam BAMFILE\n Bam File to process. Choose this or target_file. The\n resulting input file will be writen in {out}.in.xz, so\n it doesn\'t need to be regenerated. If the input file\n exists, an error is generated unless --force-target-\n file is set\n --force-target-file, --force-bam, --force-infile\n --deam-cutoff DEAM_CUTOFF\n reads with deamination in positions < deam-cutoff are\n considered separately\n --minmapq MINMAPQ reads with mapq < MINMAPQ are removed\n --min-length MIN_LENGTH\n reads with length < MIN_LENGTH are removed\n --length-bin-size LENGTH_BIN_SIZE\n if set, reads are binned by length for contamination\n estimation\n --report-alleles whether contamination/error rates should be\n conditioned on alleles present at locus\n\n options that control estimation of model\n parameters:\n --dont-est-contamination\n Don\'t estimate contamination (default do)\n --dont-est-error estimate sequencing error per rg\n --dont-est-bias merge error rates ref -> alt and alt -> ref\n --dont-est-F Estimate F (distance from ref, default False)\n --est-tau, -tau Estimate tau (population structure in references)\n --F0 [F0 [F0 ...]] initial F (should be in [0;1]) (default 0)\n --tau0 [TAU0 [TAU0 ...]]\n initial log-tau (default 0), at most 1 per source\n --e0 E0, -e E0 initial error rate\n --b0 B0, -b B0 initial ref bias rate\n --c0 C0, -c C0 initial contamination rate\n\n options that control the algorithm behavior:\n --max-iter MAX_ITER, -m MAX_ITER\n maximum number of iterations\n --ll-tol LL_TOL stop EM when DeltaLL < ll-tol\n --ptol PTOL stop EM when parameters change by less than ptol\n --dont-split-lib estimate one global contamination parameter (default:\n one per read group)\n --autosomes-only Only run autosomes\n --downsample DOWNSAMPLE\n downsample coverage to a proportion of reads\n --fake-contamination FAKE_CONTAMINATION\n Adds fake-contamination from the contamination panel\n --deam-bin-size DEAM_BIN_SIZE, --deam-bin DEAM_BIN_SIZE\n bin size for deamination\n --len-bin-size LEN_BIN_SIZE, --len-bin LEN_BIN_SIZE\n bin size for deamination\n --jk-resamples JK_RESAMPLES, --n-resamples JK_RESAMPLES\n number of resamples for Jackknife standard error\n estimation\n\n output name and files to be generated:\n By default, all files are generated. However, if any of the --no-* options\n are used to disable specific files\n\n --outname OUTNAME, --out OUTNAME, -o OUTNAME\n Output file path (without extensions)\n --no-snp Disable writing posterior genotype likelihood to file\n with extension .snp.xz\n --no-cont Disable writing contamination estimates to file with\n extension .bin.xz\n --no-pars Disable writing parameters to file with extension\n .pars.yaml\n```\n\n\n',
'author': 'benjamin_peter',
'author_email': 'benjamin_peter@eva.mpg.de',
'maintainer': 'None',
'maintainer_email': 'None',
'url': 'None',
'packages': packages,
'package_data': package_data,
'install_requires': install_requires,
'entry_points': entry_points,
'python_requires': '>=3.8',
}
from build import *
build(setup_kwargs)
setup(**setup_kwargs)