Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
200ee4e
modules/genome_db.py: circular genome
domenico-simone Jun 30, 2020
a4923d9
New function and rule sam_to_ids, with aux function to check bitwise …
domenico-simone Jun 30, 2020
54e6610
keep_orphans
domenico-simone Jul 1, 2020
f7b4e74
Fixed keywords
domenico-simone Jul 1, 2020
d2742bb
Fixed keywords
domenico-simone Jul 2, 2020
899a95b
Fixed function import and keywords
domenico-simone Jul 2, 2020
4382614
modules/filter_alignments.py: added functions cat_alignment, cat_alig…
domenico-simone Jul 3, 2020
67641c5
Function sam_to_ids fixed
domenico-simone Jul 3, 2020
3918a6f
Fixed new alignment filtering workflow
domenico-simone Jul 3, 2020
6d776ee
Update install.sh
domenico-simone Aug 24, 2020
2eba604
install.sh: new alias command
domenico-simone Aug 25, 2020
c73fd6f
install.sh: updated alias
domenico-simone Aug 25, 2020
c4cb79d
doc/installation.rst: updates
domenico-simone Aug 25, 2020
fd92e07
envs/mtoolbox.yaml, added seqtk. Fixes #2
domenico-simone Aug 26, 2020
a929f27
variant_calling.snakefile: fixed function import
domenico-simone Aug 26, 2020
1f47c7b
New documentation: a beginning
domenico-simone Sep 10, 2020
71fc57e
New documentation, work in progress
domenico-simone Sep 10, 2020
9ff0adf
New documentation: some names are bold, were inline code
domenico-simone Sep 10, 2020
68d273b
README.md redirects to readthedocs
domenico-simone Sep 10, 2020
3b556df
README.md redirects to readthedocs
domenico-simone Sep 10, 2020
4ea48cf
run-the-pipeline.rst: picture updated
domenico-simone Sep 10, 2020
cc7433b
config.yaml: clean up
domenico-simone Sep 10, 2020
b8aa8f7
New documentation for config.yaml
domenico-simone Sep 10, 2020
e76033b
SE files are cat'd before mapping vs nuclear. Fixes #5
domenico-simone Sep 11, 2020
e058ebe
variant calling: handling merging of one VCF. Fixes #6
domenico-simone Sep 11, 2020
b86280d
Merge branch 'new_alignment_wf_2' into sept_2020_doc
domenico-simone Sep 11, 2020
e157610
Documentation: results
domenico-simone Sep 15, 2020
8246200
Documentation: fix
domenico-simone Sep 15, 2020
6314366
Documentation: more about running the pipeline
domenico-simone Sep 15, 2020
8f5b60b
Documentation: more about running the pipeline
domenico-simone Sep 15, 2020
8d90713
Documentation: more about running the pipelines
domenico-simone Sep 15, 2020
b751ac8
Documentation: more about running the pipelines
domenico-simone Sep 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
443 changes: 4 additions & 439 deletions README.md

Large diffs are not rendered by default.

15 changes: 10 additions & 5 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,31 +3,36 @@ map_dir: "map"
log_dir: "logs"
tmp_dir: "/tmp"
species:
use_custom_n_genome: False

read_processing:
trimmomatic:
options: "-phred33"
processing_options: "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
java_cmd: "java"
#jar_file: "/crex/proj/uppstore2018171/conda_envs/mtoolbox-ark/share/trimmomatic/trimmomatic.jar"
java_vm_mem: "4G"
threads: 4

map:
# where to store gmap dbs?
gmap_db_dir: "gmap_db"
gmap_threads: 4
gmap_remap_threads: 4

# wanna keep reads that lost their mate after the first mapping round?
keep_orphans: True
# remove duplicates before variant detection?
mark_duplicates: False

# mask (soft-clip) 10nt of alignments at the reads' ends before variant detection?
# this is suggested to minimize spurious variant calling due to
# misalignments at the ends of reads
trimBam: False

mtvcf_main_analysis:
# minumum distance from read end to keep in a mutation
# minumum distance from read end to keep in an indel
tail: 5
# minumum distance from read end to keep in a non-indel mutation
tail_mismatch: 5
# minimum QS for mutation
Q: 25
# minimum coverage depth for a mutation to be kept
minrd: 5
tail_mismatch: 5
2 changes: 1 addition & 1 deletion doc/_build/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: ef1168243bada5412d7a59954a1116df
config: adea03f6794de309410a30038656fa63
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file modified doc/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified doc/_build/.doctrees/index.doctree
Binary file not shown.
Binary file added doc/_build/.doctrees/installation.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added doc/_build/.doctrees/run-the-pipeline.doctree
Binary file not shown.
Binary file added doc/_build/_images/MToolBox_conf_files.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
21 changes: 15 additions & 6 deletions doc/_build/_sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
.. MToolBox-Ark documentation master file, created by
.. MToolBox documentation master file, created by
sphinx-quickstart on Thu Jul 25 14:43:49 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

Welcome to MToolBox-Ark's documentation!
========================================
Welcome to the MToolBox-snakemake documentation!
================================================

**MToolBox** is a pipeline for SNP calling and annotation in mitochondrial genomes. `Since its first publication in 2014`_, it has been used in **>30 peer-reviewed clinical studies**. We have developed a **new snakemake implementation** (available `here`_), which includes the `mtoolnote`_ module for variant functional annotation in order to facilitate its usage and to offer an integrated and user-friendly tool, with the aim of providing results, tables and plots ready to be discussed and used to drive downstream analyses and wet-lab experiments.

And with this new implementation, MToolBox is also capable of analysing also mt data from **other species**!

**Eager to use MToolBox? Follow our tutorials to install and run the pipeline!**

.. toctree::
:maxdepth: 2
:caption: Contents:
:caption: Contents

feature-a

installation
run-the-pipeline

Indices and tables
==================
Expand All @@ -20,3 +26,6 @@ Indices and tables
* :ref:`modindex`
* :ref:`search`

.. _`here`: https://github.com/mitoNGS/MToolBox_snakemake
.. _`Since its first publication in 2014`: https://pubmed.ncbi.nlm.nih.gov/25028726
.. _`mtoolnote`: https://github.com/mitoNGS/mtoolnote
42 changes: 42 additions & 0 deletions doc/_build/_sources/installation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Installation
============

Install Anaconda
----------------

`MToolBox_snakemake`_, an update of the `MToolBox pipeline`_, is deployed in a conda environment, *i.e.* a virtual environment with all the needed tools/modules. Installing Anaconda is therefore essential, before installing the pipeline.

To this purpose, please follow instructions at http://docs.anaconda.com/anaconda/install/linux/ (hint: download the Anaconda installer in your personal directory with `wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh`).

Install MToolBox-snakemake
--------------------------

We recommend to download MToolBox-snakemake by cloning the official repo on `GitHub`_:

.. code-block:: bash

# Pick a folder for your installation
# and replace /path/to/MToolBox_snakemake with it
cd /path/to/MToolBox_snakemake

# fetch repo
git clone https://github.com/mitoNGS/MToolBox_snakemake.git

Please note: you could also conveniently download MToolBox-snakemake at `this link`_, but by doing so you will miss the chance to easily integrate future updates!

Once you have cloned (or downloaded and unzipped) the repo, installing MToolBox should be as easy as running

.. code-block:: bash

cd MToolBox_snakemake
bash install.sh

The setup script ``install.sh`` will:

- install the ``mtoolbox`` conda environment with all the required dependencies
- create a command (``mtoolbox-activate``) which will be used to activate the MToolBox conda environment and add the folders of MToolBox executables and utilities to your ``PATH``.

.. _`MToolBox_snakemake`: https://github.com/mitoNGS/MToolBox_snakemake
.. _`MToolBox pipeline`: https://github.com/mitoNGS/MToolBox
.. _`GitHub`: https://github.com/
.. _`this link`: https://github.com/mitoNGS/MToolBox_snakemake/archive/master.zip
12 changes: 12 additions & 0 deletions doc/_build/_sources/mtoolbox-variant-annotation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.. _mtoolbox_variant_annotation:

MToolBox-variant-annotation
===========================

This wrapper performs functional annotation of variants reported in the VCF file (the output of the :ref:`mtoolbox_variant_calling` workflow) with `mtoolnote`_. If the VCF file is not present, the wrapper will first run the :ref:`mtoolbox_variant_calling` workflow to produce it. The final output is an annotated VCF file.

.. note:: If you already have a VCF of mt variants, you might consider to annotate it by directly running `mtoolnote`_.

The setup of this workflow is detailed in :ref:`the setup of the MToolBox-variant-calling workflow<setup_working_directory>`.

.. _`mtoolnote`: https://github.com/mitoNGS/mtoolnote
53 changes: 53 additions & 0 deletions doc/_build/_sources/mtoolbox-variant-calling.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
.. _mtoolbox_variant_calling:

MToolBox-variant-calling
========================

This wrapper performs QC, quality trimming of raw reads, read alignment, alignment filtering, variant calling. The final output is a VCF file.

What should I look to, here?
----------------------------

**TL;DR**

- the VCF file in the :code:`results/vcf` folder
- the BED file(s) in the :code:`results/<sample>` folder.

Both file formats can be imported in a genome browser (*eg* IGV) to visually inspect your results.

The VCF output
^^^^^^^^^^^^^^

The VCF (variant call format) file is roughly a table where, after a ton of comment lines (starting with :code:`##`), rows are variants and columns are samples. You can find a detailed description of the VCF format `here`_, although MToolBox-snakemake provides a slightly different version to report the allele heteroplasmy frequency. To spare you some headache, we'll give you a brief summary of the genotype info you'll find for each sample.

The :code:`FORMAT` field lists data types and order that are available for samples (following fields). Each sample has colon-separated data corresponding to the types specified in the :code:`FORMAT`.

- :code:`GT` (genotype) reports all the alleles found for that sample, where :code:`0` is the reference allele (:code:`REF` field) and :code:`1`, :code:`2`, ... are the alleles in the same order as in the :code:`ALT` field.
- :code:`DP` (depth) reports the total coverage depth for that site in the genome, *i.e.* the number of reads mapping on it.
- :code:`HF` (heteroplasmic frequency) reports, for each allele in :code:`GT` (excluding :code:`REF`), the HF.
- :code:`CILOW` (confidence interval, lower bound) reports, for each allele in :code:`GT` (excluding :code:`REF`), the lower bound of the CI.
- :code:`CIUP` (confidence interval, upper bound) reports, for each allele in :code:`GT` (excluding :code:`REF`), the upper bound of the CI.
- :code:`SDP` (strand read depth) reports, for each allele in :code:`GT` (excluding :code:`REF`), the number of times the allele was observed on the plus strand and on the minus strand, semi-colon separated.

If you consider this example:

.. code-block:: bash

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Scer_mt_500K Scer_mt_100K
NC_001224.1 50000 . A C . PASS AN=4;AC=2 GT:DP:HF:CILOW:CIUP:SDP 0/1:359:0.46:0.409:0.511:90;75 0/1:75:0.387:0.284:0.5:11;18
NC_001224.1 69045 . C T . PASS AN=2;AC=1 GT:DP:HF:CILOW:CIUP:SDP 0/1:365:0.033:0.018:0.057:0;12 ./.:.:.:.:.:.

sample Scer_mt_500K, in mt position 50000, has 359 aligned reads and one variant allele (:code:`C`) with HF=0.46. The CI for this HF is 0.409 to 0.511. The variant allele is supported by 90 reads on the plus strand and 75 reads on the minus strand.

The BED output
^^^^^^^^^^^^^^

The `BED (browser extensible data) file`_ is a useful and intuitive way to inspect the variant calling results through a genome browser. Once you import this file in a genome browser, variants will be colour-coded (blue for mutations, green for insertions, red for deletions) and shaded according to the HF (the darker the shade, the higher the HF).

What to do next?
^^^^^^^^^^^^^^^^

Once you have run the wrapper, you will notice that the :code:`results` folder includes a ton of files. A guide to these files is coming soon.

.. _`here`: https://www.internationalgenome.org/wiki/Analysis/vcf4.0
.. _`BED (browser extensible data) file`: https://m.ensembl.org/info/website/upload/bed.html
Loading