Skip to content

baynec2/conduitR

Repository files navigation

conduitR

Lifecycle: experimental R-CMD-check Codecov test coverage

What is conduitR?

conduitR is an R package for metaproteomics: the large-scale identification and quantification of proteins from microbial communities (e.g. gut microbiome, soil, bioreactors). It provides a single, consistent toolkit for building search databases, processing DIA-NN output, linking proteins to taxonomy and function, and running differential analysis and visualizations.

The package powers Conduit (a Snakemake workflow for metaproteomics) and Conduit-GUI (a graphical interface to explore Conduit results), but you can use conduitR on its own for custom pipelines and analyses.

Typical workflow

  1. Database building — Get proteome FASTA files from UniProt by organism or proteome ID, concatenate them, and optionally create custom FASTA from a list of UniProt accessions.
  2. Import & structure — Convert DIA-NN parquet reports into a QFeatures object (precursors → peptides → protein groups) with assay links.
  3. Annotations — Attach taxonomy, Gene Ontology, KEGG, EggNOG, or CAZy annotations from UniProt and optional conduit annotation tables.
  4. Analysis — Run limma-style differential expression, over-representation (ORA), or GSEA; train classification/regression models (e.g. random forest, XGBoost).
  5. Visualization — Volcano plots, heatmaps, PCA biplots, taxonomic heat trees, sunbursts, and KEGG pathway figures, with consistent Conduit themes and palettes.

Features

Data and databases

  • Download proteome FASTA files from UniProt (UniProtKB and UniParc) by proteome or organism ID.
  • Concatenate FASTA files and extract metadata (protein ID, organism, taxonomy) from UniProt-style headers.
  • Create custom FASTA from a list of UniProt accessions.
  • Fetch NCBI taxonomy and UniProt proteome metadata (organism ID, proteome type).

Data processing and structure

  • DIA-NN → QFeatures: turn a DIA-NN parquet report into a QFeatures object with precursors, peptides, and protein groups.
  • Build QFeatures from sample annotations and multiple count matrices.
  • Replace zeros with NA, add log2-imputed assays, and normalize protein abundance to species level.
  • Taxonomy matrices: join DIA-NN output with FASTA and taxonomy to produce per-taxon count matrices.

Statistical analysis

  • Limma: design matrix, contrast testing, and empirical Bayes moderation for differential expression.
  • ORA & GSEA: over-representation and gene set enrichment with custom term–gene mappings (e.g. GO, species).
  • Classification/regression: LASSO, random forest, XGBoost with optional tuning; confusion matrix, ROC, precision–recall, feature importance.

Visualization

  • Volcano plots, heatmaps (static and interactive), PCA biplots.
  • Taxonomic heat trees and sunbursts, relative abundance barplots.
  • KEGG pathway figures, feature-by-sample plots, missing-value heatmaps.
  • Conduit color palettes and themes (scale_color_conduit_d, scale_fill_conduit_c, set_plot_theme, etc.).

Utilities

  • Validate UniProt accession IDs; check API reachability (UniProt, NCBI).
  • Logging with timestamps; %!in% operator; integration with existing R workflows.

Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("baynec2/conduitR")

Quick start

After installation, load the package and try a few entry points:

library(conduitR)

# Check that the UniProt API is reachable (required for downloads)
check_api_service()

# Validate UniProt IDs (no network needed)
validate_uniprot_accession_ids(c("P12345", "invalid_id", "A0A023GPI8"))

# Convert a DIA-NN parquet report to QFeatures (requires a local file)
# qf <- diann_to_qfeatures("path/to/report.parquet")
# plot_features_per_sample(qf, assay = "protein_groups")

# Run differential analysis (after building design/contrast)
# terms <- find_possible_contrast_terms(qf, "protein_groups", ~ group)
# res <- perform_limma_analysis(qf, "protein_groups", ~ group, "treatmentB - treatmentA")
# plot_volcano(res$top_table)

Function help and examples are in the built-in documentation: e.g. ?get_fasta_file, ?diann_to_qfeatures, ?perform_limma_analysis.

Dependencies

Core dependencies include QFeatures (proteomics data structures), limma (differential expression), SummarizedExperiment, Biostrings, httr2, KEGGREST, rentrez, tidyr, dplyr, ggplot2, plotly, metacoder, arrow, and others for specific features. See the DESCRIPTION file for the full list.

Documentation

  • In R: ?function_name for any exported function; many have runnable or \dontrun examples.
  • Conduit workflow: conduit.
  • Conduit-GUI: conduit-GUI.

License

MIT License; see LICENSE for details.

About

R package powering both conduit-ascent and conduit-summit by providing core functionality and data analysis tools.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages