Automated validation and submission pipeline for Metagenome-Assembled Genomes (MAGs) to the European Nucleotide Archive (ENA).
This project simplifies ENA submissions by automating:
- metadata validation
- ENA XML generation
- metadata submission via API
- assembly submission using
webin-cli - batch processing of large datasets
Clone the repository:
git clone https://github.com/NFDI4Microbiota/ena_wizard_tool.git
cd ena_wizard_toolThis project can be installed using either uv (recommended) or pip.
This is the fastest and most reproducible setup.
pip install uvor
curl -Ls https://astral.sh/uv/install.sh | shuv syncRun commands with:
uv run python nfdi-ena-cli.py --helpCreate a virtual environment:
python -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\\Scripts\\activate # WindowsInstall dependencies:
pip install -e .Required tools:
- Java (JRE/JDK)
- ENA webin-cli JAR
Expected location:
App/webin-cli-9.0.1.jar
This repository is organized into two main components:
| Component | Description | Status |
|---|---|---|
| CLI Submission Tool | Automated validation + ENA submission | β Available |
Web Application (App/) |
User-friendly interface for submission workflows | π§ Documentation coming soon |
β οΈ This README currently documents only the CLI submission tool. A dedicated section for the web application will be added later.
The CLI performs the full MAG submission workflow to ENA.
Validation is automatically performed using the ENA checklist XML:
- mandatory field checking
- regex validation
- enum validation
- row-level error reporting
Checklist currently supported:
ERC000047 (MAG checklist)
The tool automatically maps metadata entries to FASTA files:
sample_name β *.fasta.gz
Example:
sample_001.fasta.gz
The CLI automatically:
- generates ENA-compliant XML
- creates PROJECT (if needed)
- creates SAMPLE objects
- attaches checklist attributes
- submits through WEBIN v2 API
Supported portals:
- ENA TEST
- ENA PRODUCTION
After metadata submission:
- manifest files are generated automatically
- assemblies are submitted using:
webin-cli (genome context)
Special handling:
- single-contig assemblies automatically generate chromosome lists.
All submission logs are saved under:
logs/
Generated files:
log_<batch>.xml
success.txt
error.txt
Format:
TSV (tab-separated)
Example:
example.tsv
| Field | Description |
|---|---|
| sample_name | Unique sample identifier |
| organism | Scientific organism name |
| tax_id | NCBI taxonomy ID |
| genome coverage | Sequencing depth |
| platform | Sequencing platform |
| assembly software | Assembly software used |
Additional columns are automatically added as ENA sample attributes.
Example structure:
fasta/
βββ sample1.fasta.gz
βββ sample2.fasta.gz
python nfdi-ena-cli.py \
--metadata example.tsv \
--fasta-dir fasta \
--ena-user "your_username" \
--ena-password "your_password" \
--study-name "study example" \
--study-title "title for the study" \
--study-description "description for the study"python nfdi-ena-cli.py \
--metadata example.tsv \
--fasta-dir fasta \
--ena-user USER \
--ena-password PASS \
--study-accession PRJEBXXXXDefault portal:
test
To submit to production:
--portal prodMetadata TSV
β
Checklist validation
β
FASTA matching
β
ENA XML generation
β
Metadata submission (WEBIN API)
β
Manifest generation
β
webin-cli assembly submission
The CLI automatically:
- injects required ENA fields
- appends user-defined columns as SAMPLE_ATTRIBUTES
- ignores reserved internal columns
This allows metadata extension without code modification.
Large submissions are automatically split:
batch_size = 1000 samples
Each batch generates independent logs.
Credentials are passed via CLI arguments:
--ena-user
--ena-password
Recommended usage:
export ENA_USER=xxx
export ENA_PASS=xxxMissing FASTA files for: sample_X
Cause:
- FASTA filename does not match
sample_name.
Common causes:
- invalid date format
- wrong numeric format
- ontology formatting errors
Check:
logs/error.txt
Typical causes:
- invalid manifest fields
- ENA temporary API issues
- missing metadata values
Main internal functions:
| Function | Purpose |
|---|---|
load_fields_from_xml |
Parse ENA checklist |
validate_dataframe |
Metadata validation |
collect_fastas |
FASTA mapping |
build_and_submit |
Submission engine |
The App/ directory contains the web application.
Documentation to be added:
- architecture overview
- local run instructions
- deployment guide
- user workflow
- Full web interface documentation
- Interactive metadata validation
- Submission progress tracking
- Improved error visualization
- Support for multiple ENA checklists
- MIXS package auto-detection
- Ontology live validation (ENVO/CHEBI)
- Parallelized submission engine
- ENA Submission Portal: https://www.ebi.ac.uk/ena/browser/submit
- ENA Checklist ERC000047: https://www.ebi.ac.uk/ena/browser/view/ERC000047
- MIXS Standard: https://www.nature.com/articles/nbt1366
- MIXS Term Browser: https://w3id.org/mixs/
- Metagenomics researchers
- Bioinformatics pipelines
- Large-scale MAG submission projects
- Institutional data submission workflows