Lightweight targeted long-read transcriptomic profiling pipeline for Oxford Nanopore cDNA sequencing data (FFPE-optimized).
This repository contains a reproducible bioinformatic workflow for targeted long-read transcriptomic analysis using Oxford Nanopore Technologies (ONT) sequencing data.
It is designed for gene-panel–restricted expression profiling from cDNA libraries obtained from formalin-fixed paraffin-embedded (FFPE) tumor tissue.
The workflow supports:
- Targeted transcript quantification
- Mismatch repair (MMR) gene profiling
- Immune marker profiling
- Detection of fusion-like multi-gene alignments
- Lightweight normalization (RPM)
- Cohort-wide aggregation
- Optional variant calling for exploratory analysis
The pipeline is computationally lightweight and runs on standard laboratory workstations without HPC resources.
Fast alignment against a curated transcript panel using minimap2.
Detection of:
- T cell activation genes
- Cytotoxicity genes (NKG7, PRF1, GZMB, IFNG)
- Immune checkpoints (PDCD1, CD274, CTLA4, HAVCR2, LAG3, TIGIT)
Quantification of:
- MLH1
- MSH2
- MSH6
- PMS2
- EPCAM
Identification of reads aligning to multiple target genes, supporting:
- Complex rearrangements
- Potential fusion transcripts
- Artifacts in highly rearranged MSI tumors
Optional Python scripts generate:
- Heatmaps
- Barplots
- Immune activation profiles
- Quality control metrics
Validated on multiple runs using FFPE colorectal carcinoma tissue.
This pipeline is ideal for:
- Targeted long-read transcriptomics
- Clinical translational research
- Immune microenvironment characterization
- FFPE sequencing projects
- MSI-high colorectal cancer studies
- Low-input or degraded RNA workflows
- Merge FASTQ files for each barcode
- (Optional) Process with PyChopper to orient cDNA reads
- Align reads to custom transcript panel
- Extract gene names from minimap2 PAF
- Generate raw counts per gene
- Normalize to RPM
- Aggregate across multiple sequencing runs
- Perform one or more optional analyses:
- Immune profiling
- Fusion read detection
- Variant calling
- Heatmap visualization
Required:
- Python ≥ 3.10
- minimap2 ≥ 2.30
- samtools
- seqtk
- awk
- pandas
- numpy
- matplotlib
Optional:
- pychopper (for read orientation)
- bcftools (for variant calling)
Run the pipeline using the provided script:
bash scripts/run_pipeline.sh input.fastq output_folder reference.fastaA small example FASTQ file is provided in the example_data/ directory to demonstrate pipeline execution.
Example run:
bash scripts/run_pipeline.sh example_data/sample.fastq test_output expanded_panel.labeled.faconda create -n ont-cdna python=3.12 -y
conda activate ont-cdnaconda install -c bioconda minimap2=2.30 samtools seqtk bcftools pychopper -ypip install pandas numpy matplotlib seaborn- Basecalled ONT FASTQ files
- Barcoded directory structure:
fastq_pass/run/barcode01/ - Custom transcript reference (FASTA)
- Example included:
expanded_panel.labeled.fa
- Example included:
Edit these two paths first:
BASE="/mnt/d/MSI final"
PANEL="/home/username/path/to/expanded_panel.labeled.fa"
conda activate ont-cdna
set -euo pipefailGenes collected:
GENES='MLH1|MSH2|MSH6|PMS2|EPCAM|ACTB|GAPDH|RPLP0|RPS18|TP53|BRCA1|BRCA2|CD274|PDCD1|CTLA4'OUT="$BASE/combined_counts_rpm.tsv"
echo -e "RunFolder\tSample\tGene\tCount\tRPM" > "$OUT"
for PASS in "$BASE"/fastq_pass*; do
[ -d "$PASS" ] || continue
runname=$(basename "$PASS")
cd "$PASS"
for B in barcode*; do
[ -d "$B" ] || continue
cd "$B"
cat *.fastq *.fastq.gz 2>/dev/null > "${B}.merged.fastq" || true
pychopper "${B}.merged.fastq" "${B}.pychop.fastq" || true
minimap2 -K 64k -t 1 -x map-ont --secondary=no "$PANEL" \
"${B}.pychop.fastq" > "${B}.expanded.paf"
awk -F'\t' '{split($6,a,"|"); g=a[1]; c[g]++}
END{for(g in c) print g"\t"c[g]}' "${B}.expanded.paf" \
| sort > "${B}_counts.tsv"
python3 - <<EOF
import pandas as pd
df=pd.read_csv("${B}_counts.tsv",sep="\t",header=None,names=["Gene","Count"])
tot=df["Count"].sum()
df["RPM"]=(df["Count"]*1e6/tot) if tot>0 else 0
df.to_csv("${B}_counts_rpm.tsv",sep="\t",index=False)
EOF
awk -v run="$runname" -v samp="$B" -v pat="$GENES" \
'BEGIN{FS=OFS="\t"} NR>1 && $1 ~ "^(" pat ")$" {print run,samp,$1,$2,$3}' \
"${B}_counts_rpm.tsv" >> "$OUT"
cd ..
done
doneExtract:
- T cell markers (CD3D, CD3E, CD8A, TRAC)
- Cytotoxic markers (NKG7, PRF1, GZMB, IFNG)
- Immune checkpoints (PDCD1, CD274, CTLA4, HAVCR2)
awk -F'\t' '$3~/^(CD3D|CD3E|CD8A|TRAC|NKG7|PRF1|GZMB|IFNG|PDCD1|CD274|CTLA4|HAVCR2)$/' \
combined_counts_rpm.tsv | column -tawk -F'\t' '{rid=$1; split($6,a,"|"); gene=a[1]; print rid"\t"gene}' \
*.expanded.paf | sort -u > read_to_gene.tsv
cut -f1,2 read_to_gene.tsv | sort -u \
| awk '{k[$1][$2]=1} END{for(r in k){c=0; g=""; for (x in k[r]){c++; g=g","x} if (c>1) print r"\t"c"\t"substr(g,2)}}' \
> fusion_candidates.tsvminimap2 -ax splice -t 2 GRCh38.fa sample.fastq | samtools sort -o aln.bam
samtools index aln.bam
bcftools mpileup -f GRCh38.fa aln.bam | bcftools call -mv -Oz -o variants.vcf.gzExample:
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
df = pd.read_csv("combined_counts_rpm.tsv", sep="\t")
pivot = df.pivot_table(index="Gene", columns="Sample", values="RPM")
sns.clustermap(pivot, cmap="viridis", figsize=(8,12))
plt.savefig("heatmap.png", dpi=300)Per barcode:
*.expanded.paf*_counts.tsv*_counts_rpm.tsv
Global:
combined_counts_rpm.tsv- Optional: heatmaps, fusion tables, variant VCF
- Python 3.12
- minimap2 2.30
- Single-threaded execution
- RPM normalization
Stable release: v1.0.0
Please cite this repository when publishing results.
Manuscript details will be added after acceptance.
Archived at Zenodo: https://doi.org/10.5281/zenodo.18824372
MIT License