StatFunGen · hsun3163 · Mar 1, 2026 · Mar 1, 2026
diff --git a/code/molecular_phenotypes/calling/RNA_calling.ipynb b/code/molecular_phenotypes/calling/RNA_calling.ipynb
@@ -815,7 +815,7 @@
     "Documentation: [fastp](https://github.com/OpenGene/fastp)\n",
     "\n",
     "We use `fastp` in place of the `Trimmomatic` for fastp's ability to detect adapter from the reads. It was a c++ command line tool published in [Sept 2018](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234). It  will use the following algorithm to detect the adaptors:\n",
-    ">The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k\u2009=\u200910) of first N reads (N\u2009=\u20091\u2009M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter\n",
+    ">The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k = 10) of first N reads (N = 1 M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter\n",
     "\n",
     "It was demostrated that fastp can remove all the adaptor automatically and completely faster than Trimmomatic and cutadapt\n",
     "\n",
@@ -912,8 +912,8 @@
     "* `software_dir`: directory for the software\n",
     "* `fasta_with_adapters_etc`: **filename** for the adapter reference file. According to `Trimmomatic` documention,\n",
     "\n",
-    "> As a rule of thumb newer libraries will use `TruSeq3`, but this really depends on your service provider. If you use FASTQC, the \"Overrepresented Sequences\" report can help indicate which adapter file is best suited for your data. \"Illumina Single End\" or \"Illumina Paired End\" sequences indicate single-end or paired-end `TruSeq2` libraries, and the appropriate adapter files are `TruSeq2-SE.fa` and `TruSeq2-PE.fa` respectively. \"TruSeq Universal Adapter\" or \"TruSeq Adapter, Index \u2026\" indicates `TruSeq-3` libraries, and the appropriate adapter files are `TruSeq3-SE.fa` or `TruSeq3-PE.fa`, for single-end and paired-end data respectively. Adapter sequences for `TruSeq2` multiplexed libraries, indicated by \"Illumina Multiplexing \n",
-    "\u2026\", and the various RNA library preparations are not currently included.\n",
+    "> As a rule of thumb newer libraries will use `TruSeq3`, but this really depends on your service provider. If you use FASTQC, the \"Overrepresented Sequences\" report can help indicate which adapter file is best suited for your data. \"Illumina Single End\" or \"Illumina Paired End\" sequences indicate single-end or paired-end `TruSeq2` libraries, and the appropriate adapter files are `TruSeq2-SE.fa` and `TruSeq2-PE.fa` respectively. \"TruSeq Universal Adapter\" or \"TruSeq Adapter, Index …\" indicates `TruSeq-3` libraries, and the appropriate adapter files are `TruSeq3-SE.fa` or `TruSeq3-PE.fa`, for single-end and paired-end data respectively. Adapter sequences for `TruSeq2` multiplexed libraries, indicated by \"Illumina Multiplexing \n",
+    "…\", and the various RNA library preparations are not currently included.\n",
     "\n",
     "We have `fastqc` workflow previously defined and executed. Users should decide what fasta adapter reference to use based on `fastqc` results (or their own knowledge).\n",
     "\n",
@@ -1814,13 +1814,12 @@
     "[rsem_call_3,rnaseqc_call_3]\n",
     "output: f'{_input[0]:nnn}.multiqc_report.html'\n",
     "task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size\n",
+    "report: output = f\"{_output:n}.multiqc_config.yml\"\n",
+    "  extra_fn_clean_exts:\n",
+    "      - '_rsem'\n",
+    "  fn_ignore_dirs:\n",
+    "      - '*_STARpass1'\n",
     "bash:  container=container,expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint\n",
-    "    cat > ${_output:n}.multiqc_config.yml << 'MULTIQC_EOF'\n",
-    "    extra_fn_clean_exts:\n",
-    "        - '_rsem'\n",
-    "    fn_ignore_dirs:\n",
-    "        - '*_STARpass1'\n",
-    "    MULTIQC_EOF\n",
     "    multiqc ${_input:d} -v -n ${_output:b} -o ${_output:d} -c ${_output:n}.multiqc_config.yml"
    ]
   },
@@ -2098,4 +2097,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/code/snakemake_pipeline/Snakefile b/code/snakemake_pipeline/Snakefile