The aim of this pipeline is to generate 16S rRNA data from next generation of Illumina sequences. The first part of the pipeline is in the R language, a modified version of the DADA2 tutorial pipeline (https://benjjneb.github.io/dada2/tutorial.html). The pipeline includes quality filtering and trimming to remove low-quality reads and trim primers or adapters based on quality profiles. Dereplication is performed to group identical reads together and reduce redundancy. Error rates are learned directly from the data to model sequencing errors accurately. The core of the pipeline is the inference of biological sequences with single-nucleotide resolution, generating ASVs. Forward and reverse reads are merged to reconstruct full-length amplicons. Chimeric sequences are detected and removed to ensure accuracy. Finally, taxonomic assignment is performed using reference databases (SILVA or UNITE). The pipeline outputs an ASV table and representative sequences ready for downstream analysis. The second part of the pipeline takes the results of the R pipeline and, through custom Python scripts, generates a table with the differentially expressed species in different conditions, including graphs and supporting elements for the analysis.
Clone the following repository in your environment. Install the following R packages: Rcpp, dada2, fastqcr, ShortRead, Biostrings, phyloseq, MicrobiotaProcess, ggplot2, ranacapa, string, devtools Install the following Python packages: pandas, numpy, matplotlib.pyplot, openpyxl, os Install cutadapt. Follow the R script by specifying the working folder(where the data will be saved), the number of threads to be used by certain functions, the path to your reads(in .fastq.gz format), and the database to use(SILVA or UNITE). Create an excel file titled "Campioni.xlsx" where you have three columns: "Nome", "Replica" "Nome file fastq (forward reads)". In the first columns, you will specify the condition of interest for each sample taken in exam. In the second, you have a unique numerical ID of the sample, and in the last column, the name of the fastq file associated with that sample. Open the "main.py" file and modify the input and output folder accordingly. Run the script.
The R part of the script returns a series of folders: -fastqc: contains the results of fastqc before and after trimming the reads (to control if the trimming was successful and the reads are of sufficient quality) -cutadapt: contains the reads filtered by cutadapt (so you can resume the pipeline there in order to modify the script down the line to avoid refiltering and retrimming) -dada2_results: contains the files generated by dada2, including the input for phyloseq -phyloseq: contains the results of the phyloseq analysis, including alpha and beta diversity results The python script returns a table and a series of images that summarize the single tables generated by the previous R script in a more straighforward format that is easily interpretable.
Victor Borin Centurion for setting up and modifying the R pipeline. Lucas Cappelletti for the generation of the Python scripts and graphs. Edoardo Bizzotto for polishing the pipeline and automatizing the process.