Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions code/molecular_phenotypes/calling/RNA_calling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -815,7 +815,7 @@
"Documentation: [fastp](https://github.com/OpenGene/fastp)\n",
"\n",
"We use `fastp` in place of the `Trimmomatic` for fastp's ability to detect adapter from the reads. It was a c++ command line tool published in [Sept 2018](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234). It will use the following algorithm to detect the adaptors:\n",
">The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k\u2009=\u200910) of first N reads (N\u2009=\u20091\u2009M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter\n",
">The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k = 10) of first N reads (N = 1 M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter\n",
"\n",
"It was demostrated that fastp can remove all the adaptor automatically and completely faster than Trimmomatic and cutadapt\n",
"\n",
Expand Down Expand Up @@ -912,8 +912,8 @@
"* `software_dir`: directory for the software\n",
"* `fasta_with_adapters_etc`: **filename** for the adapter reference file. According to `Trimmomatic` documention,\n",
"\n",
"> As a rule of thumb newer libraries will use `TruSeq3`, but this really depends on your service provider. If you use FASTQC, the \"Overrepresented Sequences\" report can help indicate which adapter file is best suited for your data. \"Illumina Single End\" or \"Illumina Paired End\" sequences indicate single-end or paired-end `TruSeq2` libraries, and the appropriate adapter files are `TruSeq2-SE.fa` and `TruSeq2-PE.fa` respectively. \"TruSeq Universal Adapter\" or \"TruSeq Adapter, Index \u2026\" indicates `TruSeq-3` libraries, and the appropriate adapter files are `TruSeq3-SE.fa` or `TruSeq3-PE.fa`, for single-end and paired-end data respectively. Adapter sequences for `TruSeq2` multiplexed libraries, indicated by \"Illumina Multiplexing \n",
"\u2026\", and the various RNA library preparations are not currently included.\n",
"> As a rule of thumb newer libraries will use `TruSeq3`, but this really depends on your service provider. If you use FASTQC, the \"Overrepresented Sequences\" report can help indicate which adapter file is best suited for your data. \"Illumina Single End\" or \"Illumina Paired End\" sequences indicate single-end or paired-end `TruSeq2` libraries, and the appropriate adapter files are `TruSeq2-SE.fa` and `TruSeq2-PE.fa` respectively. \"TruSeq Universal Adapter\" or \"TruSeq Adapter, Index \" indicates `TruSeq-3` libraries, and the appropriate adapter files are `TruSeq3-SE.fa` or `TruSeq3-PE.fa`, for single-end and paired-end data respectively. Adapter sequences for `TruSeq2` multiplexed libraries, indicated by \"Illumina Multiplexing \n",
"\", and the various RNA library preparations are not currently included.\n",
"\n",
"We have `fastqc` workflow previously defined and executed. Users should decide what fasta adapter reference to use based on `fastqc` results (or their own knowledge).\n",
"\n",
Expand Down Expand Up @@ -1814,13 +1814,12 @@
"[rsem_call_3,rnaseqc_call_3]\n",
"output: f'{_input[0]:nnn}.multiqc_report.html'\n",
"task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size\n",
"report: output = f\"{_output:n}.multiqc_config.yml\"\n",
" extra_fn_clean_exts:\n",
" - '_rsem'\n",
" fn_ignore_dirs:\n",
" - '*_STARpass1'\n",
"bash: container=container,expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint\n",
" cat > ${_output:n}.multiqc_config.yml << 'MULTIQC_EOF'\n",
" extra_fn_clean_exts:\n",
" - '_rsem'\n",
" fn_ignore_dirs:\n",
" - '*_STARpass1'\n",
" MULTIQC_EOF\n",
" multiqc ${_input:d} -v -n ${_output:b} -o ${_output:d} -c ${_output:n}.multiqc_config.yml"
]
},
Expand Down Expand Up @@ -2098,4 +2097,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
242 changes: 0 additions & 242 deletions code/snakemake_pipeline/Snakefile

This file was deleted.

Loading