AURORA-MITO ETL pipeline

A reproducible ETL (Extract–Transform–Load) pipeline for building a local, queryable mirror of PubMed abstracts and PubTator chemical annotations.

This project extracts small compounds names that are known to inhibit mitochondrial complex I from PubMed database using LLM and enriches them with cheminformatics metadata for downstream analysis.

This repository automates:

Data acquisition – downloads the latest PubMed XML baselines + updates and PubTator3 chemical annotations.
Transformation – parses, normalizes, and converts the XML / TSV data into tables.
Loading / Curation – prepares analysis-ready datasets for downstream machine-learning or knowledge-graph projects.

Quick start

1. Clone the repository

git clone https://github.com/ndaniel/aurora-mito-etl.git
cd aurora-mito-etl

2. Download data

bash scripts/run_pipeline.sh

This downloads and verifies:

PubTator: chemical2pubtator3.gz
PubMed: baseline and updatefiles (*.xml.gz)

The results are stored under data/raw/.

System requirements

This pipeline requires a GNU/Linux environment (tested on Ubuntu 24.04 LTS).

Essential command-line tools:

bash ≥ 4.0
wget
curl
awk (gawk)
parallel
xmlstarlet
pigz
iconv
uconv
gzip
ripgrep
grep (GNU)
zcat

Python libraries are managed via requirements.txt. Recent updates to the release finalization step require the following additional packages:

requests – REST calls to PubChem and ChEMBL.
rdkit – molecular fingerprints + Tanimoto similarity scoring.
openpyxl – Excel writer backend for the per-run summary workbook.

Provenance

Each data source stores its own metadata file:

data/raw/pubtator/release_info.txt
data/raw/pubmed/release_info.txt

These include:

source URLs
fetch timestamps
file counts
last-modified headers
checksums

Ensuring every ETL run is fully traceable and reproducible.

Data Sources and Licensing

This project does not host, redistribute, or modify PubMed or PubTator content. All data are downloaded directly from official NCBI FTP servers:

PubMed: https://ftp.ncbi.nlm.nih.gov/pubmed/
PubTator3: https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/
PubChem: https://pubchem.ncbi.nlm.nih.gov/
ChEMBL: https://www.ebi.ac.uk/chembl/

The use, redistribution, and citation of the data are governed by their respective providers - primarily the U.S. National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI).

For details on licensing and reuse, refer to:

All users of this pipeline are responsible for complying with the terms and conditions of those data providers.

Processed outputs

The finalization script (scripts/finalize_realease.py) assembles the release artifacts under data/processed/<date>/ and augments the compound summary with cheminformatics diagnostics:

Pulls SMILES strings via PubChem first, then falls back to ChEMBL.
Generates RDKit Morgan fingerprints (ECFP4/2048) for similarity scoring.
Assigns PubMed-driven confidence bins and complementary RDKit similarity labels.
Flags biguanide-like chemistry to spotlight core pharmacophore analogs.
Exports TSV artifacts for pipelines and Excel mirrors for analyst review.

The primary table all_mito_complex_I_inhibitors.txt now includes:

Core attributes: compound, pubmed_references, known_status, confidence_pubmed, pubmed_ids.
Similarity scores: MaxSim_all, TopKMean_all, BestRef_name, confidence_similarity.
Biguanide diagnostics: has_biguanide_core, has_biguanide_motif, sim_biguanide_tversky, sim_biguanide_dice, best_biguanide_like_tversky, best_ref_name_tversky, best_biguanide_like_dice, best_ref_name_dice.
Structural context: trailing SMILES column for downstream QSAR work.

An Excel mirror (all_mito_complex_I_inhibitors.xlsx) is emitted alongside the TSV to simplify exploratory review.

Refer to etl/schema/DATA_DICTIONARY.md for the full column reference.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
etl/schema		etl/schema
scripts		scripts
.gitignore		.gitignore
DATA_LICENSE.md		DATA_LICENSE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AURORA-MITO ETL pipeline

Quick start

1. Clone the repository

2. Download data

System requirements

Provenance

Data Sources and Licensing

Processed outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AURORA-MITO ETL pipeline

Quick start

1. Clone the repository

2. Download data

System requirements

Provenance

Data Sources and Licensing

Processed outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages