This is a workflow that allows easily using a number of population structure analysis tools, with the goal to streamline workflows that allow easy comparison between methods and approaches. The original reason was to compare EEMS with other methods.
The ultimate goal is to be able to use a comand of the form
snakemake figures/pca/2d/europe0_pc1.pngto automatically and reproducibly create a figure of a 2D-pca plot for a subset of the data called "europe0", with sensible choices for display; the command
snakemake all_subsets_pcato generate pca-plots for all defined subsets of the data, and e.g.
snakemake figures/pca/2d/poster/europe0_pc1.pngto automatically generate a version appropriate for a poster.
The first two commands are currently implemented, (and will automatically create the subset, do some basic QC, run PCA and plot the result).
- the
Snakefileis the main file that is used to call all analyses sfiles/contains rules for specific tools.sfiles/pca.snake, for example, controlls input formatting, managing options, running and plotting for PCA plots,sfiles/eems.snakedoes the same for EEMS.scripts/contains scripts that are called from rules specified insfiles.config/contains the configuration files that specify the analyses.subsettercontains a python module that handles subsetting data. This is currently done using plink, but another tool (e.g. vcftools or angst) could possibly be developed.
Thus, ideally a user of the pipeline would only need to change some config files, whereas a developer of a new method would need to write rules (in sfiles/) and modify the Snakefile to link this file. This modular approach has the advantage that the developer has full freedom of how he wants to implement his approach, as long as he specifies the files required, and the files generated.
These methods are all implemented in various degrees of completeness. EEMS, flashpca are well implemented, admixture and pong are as well, with the caveat that the ordering of samples is at times strange.
cluster.yamlcontains job-specific info for cluster resourcesconfig.yamlcontains data and server specific info, in particular paths to the data and executablessubset.yamlcontains info for subsets, i.e. which samples should be included in a runeems.yamlcontains specification for different types of eems runsplots.yamlcontains info about options for different EEMS plots
The major limitation of the repo currently is that all the options are undocumented and will therefore be unusable without digging through the files.
Genotypic data is stored in plink format.
Metadata/location data is stored using John Novembre's
PopGenStructures
data format, with some minor (recommended) changes.
The pipeline is implemented using Snakemake,
using python for most data wrangling and R for most plotting
This is a draft intended at showcasing the intended structure of the project. This is NOT a working version (as the version I use handles sensitive data, I cannot just push it to github).