Nextflow pipeline for PacBio sequencing quality control, de novo genome assembly, and de-contamination of non-Nematoda DNA.
_ _ __
| | | | / _|
__ _ ___ ___ ___ _ __ ___ | |__ | |_ _ _ __ | |_
/ _` / __/ __|/ _ \ '_ ` _ \| '_ \| | | | | ______ | '_ \| _|
| (_| \__ \__ \ __/ | | | | | |_) | | |_| | |______| | | | | |
\__,_|___/___/\___|_| |_| |_|_.__/|_|\__, | |_| |_|_|
__/ |
|___/
By default, the pipeline will assemble genomes in "inbred" mode (i.e., will not purge duplicate haplotigs) for hifiasm.
If running with --source umd, then provide the folder in /vast/eande106/data/transfer/raw where sequencing data has been deposited (e.g. --raw_dir 20250314_PacBio). When running in --source umd, the output directory will be set to "${raw_dir}-assembly" even if you specify an --outdir.
If running with --source default, then provide a --sample_sheet that contains columns "strain" and "bam_path" (include these headers and absolute paths to bam files). The user can specify an --outdir where all genomes and statistics will be deposited.
If running with --ext_master, and using the master sheet for merging and re-assembling, then provide either a --sample_sheet or --raw_dir in addition to the --ext_master sheet for running in --source default or umd, respectively. If --ext_master is run with NO matching strains in raw.dir or the provided sample_sheet, the following error will occur:"ERROR ~ Unknown method invocation contains on PoisonPill type -- Did you mean? toString"
This pipeline can also be run in --source default with a --sample_sheet that contains bams from multiple runs of the same strain, and merging of sequencing data for the same strain will still occur as the pipeline checks for multiple entries of the same strain.
To filter out contigs that might be constructed from non-Nematoda DNA, use parameter --blobtools set to "yes" when running with --type reads. If the user desires to only run blobtools on already assembled contigous genomes in the FASTA format, the user can specify --type assembly and exclude the --blobtools parameter and provide the absolute path to a TSV file with header: strain, asm_fa, species, bam, r_stats. The absolute paths must be provided for asm_fa, bam, and r_stats, and the TSV should be supplied to parameters --sample_sheet. If working with an assembly that was created from merging bams, please provide multiple entries for the same assembly - one entry for each bam file.
When running in Rockfish, use -profile rockfish and it will use the conf/rockfish.config configuration file. Use flag "-resume" to resume an analysis and retrieve any cached data.
nextflow run main_wBlobToolKit.nf --source default --type reads --sample_sheet --raw_dir 2025_newData --blobtools yes -profile rockfish
nextflow run main_wBlobToolKit.nf --source default --type reads --sample_sheet /<absolutePath>/strain_bamPaths.tsv --outdir 2025_assemblies -profile rockfish
Running in --source default with an --ext_master sheet provided to merge bam files of the same strain, and then running blobtools on the assemblies
nextflow run main_wBlobToolKit.nf --source default --type reads --sample_sheet /<absolutePath>/strain_bamPaths.tsv --outdir assembleAndBlob --blobtools yes -profile rockfish
nextflow run main_wBlobToolKit.nf --source default --type assembly --sample_sheet /<absolutePath>/strain_asmPaths.tsv --outdir blob_assembly -profile rockfish