FANTASIA Lite V1 is a streamlined, standalone version of the full FANTASIA pipeline, designed for fast and efficient Gene Ontology (GO) annotation of protein sequences from local FASTA files using embedding comparisons.
FANTASIA Lite generates deep learning embeddings and perform nearest-neighbor annotation transfer, while intentionally removing the service stack used by the full project. The bundled lookup table covers multiple reference embedding spaces, while the built-in Lite embedder focuses on the models currently exposed directly through the local CLI.
The simplest way to run it is:
./scripts/minimal_pipeline.sh your_proteins.faBefore any annotation run can work, the required lookup bundle must be downloaded from Zenodo and placed in data/lookup/.
Unlike the full FANTASIA pipeline, FANTASIA Lite does not require PostgreSQL, RabbitMQ, or a database-backed orchestration layer. It runs locally from flat files:
lookup_table.npzfor reference embeddingsannotations.jsonfor GO annotationsaccessions.jsonfor accession mapping
The tradeoff is that Lite is simpler to deploy but can be slower if embedding and lookup are not tuned well. This repository now includes GPU-aware embedding and lookup controls intended to recover as much performance as possible without reintroducing external services.
The main purpose of Lite V1 is to provide a fast local annotator that can be dropped into other pipelines with minimal setup. The default fast path is intentionally simple: ProtT5 embeddings, cosine lookup, and k=1 transfer. More advanced configuration is still available when you need multi-model runs, layered embedding export, larger k, or other custom settings.
This repository is ideal for users who want:
- Lightweight, local annotation of protein FASTA files
- No external database dependencies
- No PostgreSQL or RabbitMQ setup
- Simple setup and automated environment management
- High-quality functional annotation using experimental evidence from UniProt
For advanced features, large-scale annotation, or integration with external databases, see the full FANTASIA repository.
Compared with the earlier Lite V0 branch, Lite V1 includes several practical improvements:
- Faster end-to-end execution through batched embedding inference and a merged one-pass lookup flow
- In-process lookup execution instead of launching a separate lookup subprocess
- Reuse mode for rerunning lookup from an existing embeddings archive with
--reuse-embeddings - Optional TopGO export, now disabled by default unless
--topgois requested - Clearer GPU tuning controls for embedding and lookup batch sizes
- Full-length embeddings by default, with truncation disabled unless explicitly requested
- Better reporting of skipped sequences and end-of-run coverage summaries
- Embedding-only workflows via
src/generate_embeddings.py, separate from lookup - Optional export of selected layers or all available layers from the Lite embedder
- Helper scripts for a default last-layer end-to-end run and an embedding-only layered export run
- Updated documentation that explains the separation between embedding support and lookup-bundle coverage
- Python 3.10 or newer (the pipeline automatically creates and manages virtual environments)
- Required lookup bundle (
lookup_table.npz,annotations.json,accessions.json) from Zenodo: 19742926 placed indata/lookup/before running the pipeline - Internet connection for automatic dependency installation
- Sufficient disk space for outputs and embeddings (approximately 1-2 GB per run)
- Git (for cloning the repository)
wgetorcurl(for downloading the lookup bundle)
The FANTASIA Lite V1 lookup table is built from the UniProt November 2025 release and includes only proteins with experimental evidence, ensuring high-quality functional annotations. All data was generated using PIS v3.1.0, the internal system used to extract and preprocess UniProt, PDB, and GOA data.
Lookup bundle Zenodo DOI: 10.5281/zenodo.19742926
Use this DOI to cite the lookup table or to access the official download page.
- Reference entries in
accessions.json: 124,397 - Annotated reference entries in
annotations.json: 124,397 - Total GO annotation rows in
annotations.json: 621,024
- ESM-2: 124,363 embeddings
- ProstT5: 124,248 embeddings
- ProtT5-XL-UniRef50: 124,239 embeddings
- Ankh3-Large: 124,338 embeddings
- ESM3c: 124,397 embeddings
These are the measured per-model embedding counts in the current lookup bundle. annotations.json and accessions.json both contain 124,397 entries, while the individual model spaces in lookup_table.npz can differ slightly in coverage.
The lookup bundle (fantasia_lite_data_folder.tgz) contains three essential files:
-
lookup_table.npz- Precomputed protein embeddings for ESM-2, ProstT5, ProtT5, Ankh3-Large, and ESM3c
- Last-layer compressed embeddings for all reference sequences
- Enables fast nearest-neighbor search during annotation
- Format: NumPy .npz archive
-
annotations.json- GO annotations of the reference proteins
- Experimentally supported GO terms by category:
- F: Molecular Function
- P: Biological Process
- C: Cellular Component
- Format: JSON mapping from proteins to their GO terms
-
accessions.json- Mapping of internal indices to UniProt accessions
- Contains UniProt ID, metadata, and sequence length
- Allows the pipeline to retrieve source identifiers
- Format: JSON list/dict
lookup_table.npzis the reference embedding space. It does not contain GO terms by itself. It stores the precomputed vectors that query embeddings are compared against.annotations.jsonis the annotation layer. After the nearest reference proteins are found, this file is used to expand those hits into GO terms, GO categories, descriptions, and evidence codes.accessions.jsonis the identifier layer. It maps the internal lookup-table positions back to UniProt accessions and metadata.- Lookup is always performed against the full reference table for the selected model space.
--lookup-batch-sizecontrols how many query embeddings are processed together, not how many reference proteins are searched. - In Lite,
kis controlled by--limit-per-entry. Withk=1, the pipeline keeps only the top reference hit per query. With largerk, it keeps more nearest neighbors before GO-term consolidation. results.csvis the consolidated final annotation table. It keeps the best supporting row for each(query, GO term, category)combination.raw_results.csvis optional and preserves the neighbor-level rows before GO-term consolidation. It is especially useful whenk > 1.- The default fast Lite V1 path is last-layer lookup with cosine distance, ProtT5 embeddings, and
k=1. Layered embedding export is separate from lookup. - Layer selection is intended for embedding-only workflows. The bundled lookup workflow uses the standard last-layer embeddings and does not switch lookup behavior based on
--layer-indices.
The lookup table includes only high-confidence experimental annotations:
- EXP — Inferred from Experiment
- IDA — Inferred from Direct Assay
- IPI — Inferred from Physical Interaction
- IMP — Inferred from Mutant Phenotype
- IGI — Inferred from Genetic Interaction
- IEP — Inferred from Expression Pattern
- TAS — Traceable Author Statement
- IC — Inferred by Curator
No database server or external dependencies are required.
FANTASIA Lite V1 is designed first as a fast local annotator that is easy to integrate into other pipelines, while still supporting more configurable research-style runs when needed.
The easiest default path is:
./scripts/minimal_pipeline.sh your_proteins.faThis script keeps the intended Lite defaults explicit:
prot_t5- cosine lookup
k=1- full precision (
float32) - automatic CPU/GPU selection
In Lite, k is controlled by --limit-per-entry.
Beyond that, Lite V1 provides two main Python entrypoints:
For processing protein sequences and obtaining GO annotations:
Lite V1 keeps embedding and lookup in full precision (float32) for comparability with the full FANTASIA workflow. Mixed precision is not enabled by default.
# Basic usage - single model annotation
fantasia_pipeline your_proteins.fa
# Recommended general usage with a single model
fantasia_pipeline --serial-models --embed-models prot_t5 your_proteins.fa
# Recommended fast GPU usage without sequence truncation
fantasia_pipeline \
--device cuda \
--use-gpu-lookup \
--serial-models \
--embed-models prot_t5 \
--sequence-queue-package 100 \
--embed-batch-size 4 \
--model-batch-sizes prot_t5=4 \
--lookup-batch-size 1024 \
your_proteins.fa
# Multiple models (slower but more comprehensive; keep --serial-models enabled)
fantasia_pipeline --serial-models --embed-models "prot_t5 ankh3" your_proteins.fa
# Advanced configuration
fantasia_pipeline \
--embed-models prot_t5 \
--limit-per-entry 5 \
--results-csv my_results.csv \
your_proteins.faThree ready-to-run helper scripts are included under scripts/:
These helper scripts use paths relative to the repository root. Run them from inside the cloned FANTASIA-Lite directory. If needed, you can override the Python interpreter with PYTHON_BIN=/path/to/python.
-
scripts/minimal_pipeline.shThis is the main Lite V1 entrypoint for fast annotation and pipeline integration. It runs the standard end-to-end workflow with the intended Lite defaults:- ProtT5 only
- cosine distance
k=1- automatic device detection
- full precision
- no truncation unless you explicitly request it in another command
Example:
./scripts/minimal_pipeline.sh fasta_test/PRUB1_longiso.pep-
scripts/default_last_layer.shRuns the standard end-to-end Lite workflow with the current recommended tuned settings for a 24 GB-class GPU:- ProtT5 only
- last layer only
- GPU lookup enabled
- full-precision embeddings
- TopGO disabled by default
Example:
./scripts/default_last_layer.sh fasta_test/PRUB1_longiso.pep-
scripts/embedding_only_with_layers.shRuns embedding generation only, with no lookup, and exports the default last-layer embeddings plus either:- all transformer layers when no extra layer arguments are given
- selected layers when you pass them explicitly
For ProtT5, layer numbering follows the model outputs directly:
0is the earliest exported hidden-state representation, while24is the final encoder layer. The standard last-layer embedding is also always written separately asProt-T5_embeddings. For Ankh3, the same ordering rule applies:0is the earliest exported hidden-state representation and48is the final exported layer. The standard last-layer embedding is also written separately asAnkh3-Large_embeddings. These layer options should be used in embedding-only mode, not as a replacement for the default lookup workflow.
Examples:
# Export all available layers
./scripts/embedding_only_with_layers.sh fasta_test/PRUB1_longiso.pep outputs_layers.npz
# Export only selected layers
./scripts/embedding_only_with_layers.sh fasta_test/PRUB1_longiso.pep outputs_layers.npz 0 8 16 24Key Options:
--embed-models: Choose models (prot_t5,ankh3) - default:prot_t5--serial-models: Process requested models one after another instead of as one combined model-group. For a single model, this usually makes no practical difference. For more than one model, it is the recommended and safer setting because it reduces GPU/CPU memory pressure and avoids loading multiple embedders at the same time.--limit-per-entry N: Return top N annotations per sequence (default: 1)--raw-results-csv PATH: Optional raw neighbor-level output before GO-term consolidation. If omitted, Lite writes onlyresults.csvunless--limit-per-entry > 1, in which case a raw file is created automatically ask.<N>.results.csv--topgo: Optional. Generate TopGO files after lookup. TopGO export is disabled by default--distance-metric {cosine,euclidean}: Lookup metric. Lite V1 defaults tocosine, buteuclideanis also supported--use-gpu-lookup: Force GPU nearest-neighbor lookup when CUDA is available--lookup-batch-size N: Number of query embeddings compared per lookup batch--sequence-queue-package N: Outer packaging size before embedding forward passes--embed-batch-size N: Default forward-pass batch size for embeddings--model-batch-sizes MODEL=N ...: Per-model embedding batch size overrides--length-filter N: Optional truncation before embedding;0disables truncation and is the default
FANTASIA Lite V1 is installed by cloning the repository, placing the lookup bundle in data/lookup/, and running a simple setup check from inside the cloned folder.
git clone https://github.com/CBBIO/FANTASIA-Lite.git
cd FANTASIA-LiteFANTASIA Lite V1 is the current default branch of this repository. The previous Lite V0 state remains available in the fantasia-lite-V0 branch.
Download the Lite lookup bundle from Zenodo and place these files in data/lookup/:
lookup_table.npzannotations.jsonaccessions.json
The recommended validation step for Lite V1 is:
./scripts/minimal_pipeline.sh fasta_test/test.faThis setup check:
- creates or reuses the local virtual environment
- installs the required dependencies automatically
- runs the current Lite V1 default path on the small bundled test FASTA
- writes results to a timestamped
outputs_YYYYMMDD_HHMMSS/directory - confirms that embedding, lookup, and result writing are working correctly
Expected output: If everything works correctly, you'll see:
- Virtual environment creation (first run)
- Dependency installation progress
- Embedding generation progress bar
- Annotation results written to the output directory
- Success message
Time estimate:
- First run: a few minutes to about 10-20 minutes, depending on dependency installation and model download state
- Subsequent runs: 1-2 minutes (only processes test file)
The easiest way to run Lite V1 in its intended default mode is:
./scripts/minimal_pipeline.sh your_proteins.faThis minimal runner is the recommended entrypoint for fast local annotation and for integration into larger workflows. It uses:
- automatic device detection
prot_t5- cosine lookup
k=1- full precision
- no sequence truncation unless you explicitly request it elsewhere
You can still call the full pipeline directly when you want more control:
fantasia_pipeline --serial-models --embed-models prot_t5 your_proteins.faFANTASIA Lite automatically detects whether CUDA is available:
- If an NVIDIA GPU is detected, the pipeline defaults to
cuda - Otherwise it falls back to
cpu
This detection is performed automatically by fantasia_pipeline.py, so most users do not need to set --device manually.
You can still override it explicitly:
# Force GPU
fantasia_pipeline --device cuda your_proteins.fa
# Force CPU
fantasia_pipeline --device cpu your_proteins.faFANTASIA Lite is intentionally simpler than the full FANTASIA stack, but that simplicity means performance depends heavily on the embedding and lookup settings.
-
--sequence-queue-packageOuter orchestration size. This controls how many input sequences are grouped into one work package before embedding. It does not change search completeness. -
--embed-batch-sizeDefault PLM forward-pass batch size. This controls how many sequences are embedded together on the GPU or CPU. -
--model-batch-sizes prot_t5=4 ankh3=4Per-model overrides for embedding forward-pass batch size. These matter only for models you actually run. The best value depends on GPU memory and sequence lengths: if you see CUDA OOM skips infailed_sequences.csv, lower the relevant model batch size; if you see no skips and have spare memory, you can try increasing it. -
--lookup-batch-sizeNumber of query embeddings processed together during nearest-neighbor search. Each lookup batch is still compared against the full reference lookup table. It does not search only the firstNreferences and then stop.
By default, sequence truncation is disabled:
--length-filter 0That means FANTASIA Lite will try to embed the full sequence. If the GPU can handle it, the full sequence is used. If you want to cap sequence length for safety or speed, set a positive value such as --length-filter 2000.
For the fastest general GPU run without truncation:
fantasia_pipeline \
--device cuda \
--use-gpu-lookup \
--serial-models \
--embed-models prot_t5 \
--sequence-queue-package 100 \
--embed-batch-size 4 \
--model-batch-sizes prot_t5=4 \
--lookup-batch-size 1024 \
your_proteins.faThis is the recommended starting profile for FANTASIA Lite on 24 GB-class GPUs when you want full-length embeddings and the fastest practical proteome-scale runtime. On a 20,223-sequence PRUB1 proteome, this profile completed end-to-end in about 24 minutes with GPU lookup enabled and processed 99.72% of proteins successfully.
At the end of each run, the pipeline prints a processed/skipped summary such as:
Sequence summary: 20167/20223 processed (99.72%), 56/20223 skipped (0.28%).
Skipped proteins are reported in failed_sequences.csv and are typically extreme long-sequence CUDA OOM cases rather than ordinary proteins.
For smaller GPUs:
fantasia_pipeline \
--device cuda \
--use-gpu-lookup \
--serial-models \
--embed-models prot_t5 \
--sequence-queue-package 50 \
--embed-batch-size 4 \
--model-batch-sizes prot_t5=4 \
--lookup-batch-size 256 \
your_proteins.faFor CPU-only systems:
fantasia_pipeline \
--device cpu \
--serial-models \
--embed-models prot_t5 \
your_proteins.faFor benchmarking, performance testing, and systematic analysis:
# Basic timing analysis - processes all test files with the default model
python3 src/pipeline_timing_analyzer.py
# Quick test with specific file and single model
python3 src/pipeline_timing_analyzer.py \
--files fasta_test/test.fa \
--model prot_t5 \
--report-csv quick_benchmark.csv
# Compare models on specific files
python3 src/pipeline_timing_analyzer.py \
--files fasta_test/test.fa fasta_test/UP000001940_6239.fasta \
--model prot_t5 ankh3 \
--report-csv model_comparison.csv
# Custom analysis with all options
python3 src/pipeline_timing_analyzer.py \
--fasta-dir fasta_test \
--model ankh3 \
--files fasta_test/test.fa \
--report-csv gpu_benchmark.csvKey Options:
--files: Specify individual FASTA files to process--model: Choose specific model(s) to test (default:prot_t5)--report-csv: Output file for timing results (default:pipeline_timing_analysis.csv)--fasta-dir: Directory containing FASTA files (default:fasta_test)
FANTASIA-Lite/
├── README.md # This documentation
├── LICENSE # License information
├── .gitignore # Git ignore rules
├── data/
│ └── lookup/ # Lookup database (download from Zenodo)
│ ├── accessions.json # Protein accession mappings
│ ├── annotations.json # GO annotation data
│ └── lookup_table.npz # Pre-computed embeddings database
├── fasta_test/ # Test FASTA files for validation and benchmarking
│ ├── test.fa # Small test file (33 sequences)
│ ├── test_failure.fa # Test file with problematic sequences
│ ├── PRUB1_longiso.pep # Paratomella rubra proteome (non-model worm, not represented in standard databases)
│ ├── UP000001940_6239.fasta # C. elegans proteome sample
│ └── MUSM_10090.fasta # Mouse proteome sample used for Lite V1 benchmarking
├── fantasia_pipeline.py # Main annotation pipeline
├── fantasia_no_db.py # Core lookup engine
├── generate_embeddings.py # Embedding generation module
└── pipeline_timing_analyzer.py # Performance analysis and benchmarking tool
The repository includes comprehensive test files for validation and benchmarking:
test.fa: Small test file with 33 valid sequences for quick validationtest_failure.fa: Contains problematic sequences to test error handlingPRUB1_longiso.pep: Proteome of the non-model worm Paratomella rubra, useful as a realistic dark-proteome style test case outside standard database-centric examplesUP000001940_6239.fasta: Complete C. elegans proteome for realistic testingMUSM_10090.fasta: Mouse proteome sample used in the revalidated Lite V1 benchmark set
Each pipeline run creates a timestamped directory containing:
results.csv: Main GO annotation results after GO-term consolidation and best-row selectionraw_results.csvork.<N>.results.csv: Optional raw neighbor-level results before GO-term consolidationquery_embeddings.npz: Generated embeddings for input sequencesfailed_sequences.csv: Sequences that failed processing with error detailsfantasia_config.yaml: Configuration used for the runrun_metadata.yaml: Timestamped run metadata for traceability, including resolved parameters and output pathsrun.log: Timestamped pipeline log capturing console output from the runtopgo/: Optional TopGO-compatible files for downstream analysis when--topgois enabled<model>.topgo.<F|P|C>.txt: GO terms by functional category (Function/Process/Component)
results.csv: CSV table. This is the main file most users want. Each row is a final GO annotation kept after consolidation, so it is the best place to inspect the final biological interpretation of the run.raw_results.csvork.<N>.results.csv: CSV table. This is the pre-consolidation lookup output. Use it when you want to inspect neighbor structure,hit_rank, or the effect ofk > 1.query_embeddings.npz: NumPy archive. This stores the generated query embeddings and is mainly useful for reuse, benchmarking, or embedding-only downstream work rather than manual inspection.failed_sequences.csv: CSV table. This lists the sequences that were skipped or failed, together with the error reason. In full-length GPU runs, these are often extreme long-sequence CUDA OOM cases.fantasia_config.yaml: YAML text file. This captures the effective run configuration and is the main provenance file for reproducing a run.run_metadata.yaml: YAML text file. This records traceability metadata such as timestamps, resolved parameters, output paths, summary counts, and stage timings.run.log: Plain-text log. This is the chronological console transcript of the run and is the best file to inspect when debugging behavior or checking progress details after the fact.topgo/: Directory of plain-text TopGO helper files. These are only created when--topgois enabled and are meant for downstream enrichment workflows, not for primary lookup interpretation.
results.csv vs raw_results.csv
raw_results.csvpreserves the direct lookup output. It keeps the selected nearest-neighbor hit structure, includinghit_rank, before GO-term consolidation. If one reference hit carries multiple GO terms, or if--limit-per-entryis greater than1, this file can contain multiple rows for the same query and hit.results.csvis the cleaned final annotation table. It consolidates GO terms and keeps the best supporting row for each(query, model, GO term, category)combination.- If you only need final annotations,
results.csvis usually enough. If you want to inspect the underlying hit structure or retain multiple neighbors before consolidation, keepraw_results.csv.
Column meanings
query_accession: input sequence identifier from the FASTA filehit_rank: nearest-neighbor rank in the raw output only (1= best hit,2= second hit, etc.)reference_id: internal lookup-bundle identifier for the matched reference entrymodel_key: embedding space used for the match, such asProt-T5orAnkh3-Largedistance: embedding-space distance between query and matched reference; lower is betterreliability_index: normalized score derived from distance and clipped to[0, 1]; higher is betterdistance_metric: lookup metric used for the match, usuallycosinein Lite V1uniprot_accession: UniProt accession of the matched reference proteingo_id: transferred GO identifiercategory: GO namespace (F= Molecular Function,P= Biological Process,C= Cellular Component)go_description: human-readable GO term descriptionevidence_codes: evidence code or merged evidence codes supporting that GO term in the reference protein
pipeline_timing_analysis.csv: Comprehensive performance metrics including:- GPU model and memory specifications
- Runtime and processing rates
- GPU/CPU usage information
- Sequence processing statistics
- Successfully processed vs failed sequences
- GPU memory usage monitoring
- Model comparison data
- Timestamped pipeline output directory references
Note: Requires nvidia-smi for GPU monitoring (optional for CPU-only systems).
The pipeline_timing_analyzer.py tool provides comprehensive benchmarking capabilities:
- Hardware Comparison: Compare GPU vs CPU performance across different systems
- Model Evaluation: Systematic comparison between prot_t5 and ankh3 models
- Scalability Testing: Analyze performance across different file sizes and sequence counts
- Regression Testing: Track performance changes across pipeline versions
- Resource Monitoring: GPU memory usage and processing rate analysis
The Lite V1 pipeline changed substantially, so older benchmark matrices from earlier Lite revisions are no longer directly representative. The benchmark section below only reports runs that have been revalidated after the current V1 optimization work.
The following end-to-end benchmarks reflect the current recommended fast profile:
./scripts/default_last_layer.sh fasta_test/PRUB1_longiso.pep
./scripts/minimal_pipeline.sh fasta_test/UP000001940_6239.fasta
./scripts/minimal_pipeline.sh fasta_test/MUSM_10090.fastaThis corresponds to:
prot_t5- cosine lookup
k=1- full precision (
float32) - full-length embeddings by default
- GPU lookup enabled
- tuned batch sizes for a 24 GB-class GPU (
prot_t5=4)
| Dataset | Sequences | Model | Runtime | Rate (seq/s) | Coverage | Failed |
|---|---|---|---|---|---|---|
| PRUB1_longiso.pep | 20,223 | ProtT5 | 22m 38.60s | 14.89 | 99.72% | 56 |
| UP000001940_6239.fasta | 19,831 | ProtT5 | 23m 33.86s | 14.03 | 99.75% | 49 |
| MUSM_10090.fasta | 21,852 | ProtT5 | 35m 43.23s | 10.20 | 99.31% | 150 |
The following comparison keeps the device, model family, and full-length setting aligned, and changes only the lookup depth:
| Dataset | k | Runtime | Rate (seq/s) | Coverage | Failed |
|---|---|---|---|---|---|
| PRUB1_longiso.pep | 1 | 22m 38.60s | 14.89 | 99.72% | 56 |
| PRUB1_longiso.pep | 5 | 23m 19.46s | 14.45 | 99.83% | 34 |
For the revalidated k=5 PRUB1 run, the recorded stage split was approximately 22m 41.35s embedding, 37.58s lookup, and 0.53s post-processing. In practice, increasing k from 1 to 5 added only a modest overhead because embedding remains the dominant cost.
Small validation runs on the first 5 sequences of fasta_test/test.fa were used to confirm that Lite V1 behaves correctly in CPU-only mode, GPU k=5 mode, and layered embedding export mode:
| Test | Device | Settings | Runtime | Notes |
|---|---|---|---|---|
| End-to-end lookup | CPU | ProtT5, cosine, k=1 |
118.90s | Confirms Lite works without GPU |
| End-to-end lookup | GPU | ProtT5, cosine, k=5 |
64.43s | Not directly comparable with the CPU row because the device changed |
| Embedding-only layered export | GPU | ProtT5 layers 0, 12, 24 |
12.97s | Wrote Prot-T5_layer_0_embeddings, Prot-T5_layer_12_embeddings, and Prot-T5_layer_24_embeddings, each with shape (5, 1024); here 24 is the final encoder layer, not 0 |
| Embedding-only layered export | GPU | Ankh3 all layers | 70.58s | Wrote Ankh3-Large_layer_0_embeddings through Ankh3-Large_layer_48_embeddings; representative shapes confirmed for layer_0, layer_24, and layer_48, each (5, 1536) |
Notes:
- The skipped sequences were extreme long-protein CUDA OOM cases, not ordinary proteins.
- This benchmark uses the current Lite V1 merged-lookup flow rather than the older per-chunk lookup behavior.
- For the revalidated C. elegans run, the recorded stage split was approximately 22m 42s embedding, 46.46s lookup, and 1.09s post-processing.
- For the revalidated mouse run, the recorded stage split was approximately 35m 05s embedding, 30.79s lookup, and 0.86s post-processing.
- Additional benchmark tables for other datasets and hardware should be regenerated before being treated as representative of Lite V1.
The pipeline automatically manages Python virtual environments:
# Virtual environment is created automatically in venv/
# To clean up and force rebuild:
rm -rf venv/
./scripts/minimal_pipeline.sh your_file.fa # Will recreate venv automaticallyFor processing multiple files systematically:
# Process multiple specific files
python3 src/pipeline_timing_analyzer.py \
--files file1.fa file2.fa file3.fa \
--model prot_t5
# Process all files in a directory
python3 src/pipeline_timing_analyzer.py \
--fasta-dir my_proteomes/ \
--report-csv batch_results.csvThe timing analyzer is useful for benchmarking and regression testing. In Lite V1 it now also reads per-run stage timings from run_metadata.yaml, so benchmark reports can separate embedding time, lookup time, and post-processing time.
For large files or limited memory systems:
# Use a single model, smaller chunks, and explicit serial processing
fantasia_pipeline \
--serial-models \
--embed-models prot_t5 \
--chunk-size 200 \
large_proteome.faFor the current Lite V1 fast path, prot_t5 with --model-batch-sizes prot_t5=4 is the best starting point on a 24 GB-class GPU. If you see CUDA OOM skips, reduce the model batch size further before changing lookup settings.
In Lite V1, the embedding step and lookup step are intentionally separated:
- The embedder is the local model-inference component in
src/generate_embeddings.py - The lookup is the nearest-neighbor transfer step against the flat-file reference bundle in
data/lookup/
These two layers are related, but they do not currently expose exactly the same model set through the Lite CLI.
The built-in Lite embedder currently supports:
prot_t5: Protein T5 model (recommended, good balance of speed and accuracy)ankh3: ANKH large protein language model (slower but potentially more accurate)
The Lite embedder can now also export:
- default last-layer embeddings
- selected intermediate layers with
--layer-indices - all available layers with
--all-layers
This embedding-only mode is independent of lookup. In other words, Lite can generate embeddings locally for the models above, with optional layer export, even when you do not want to run annotation lookup.
The bundled Lite lookup table currently contains last-layer reference embeddings for:
- ESM-2
- ProstT5
- ProtT5
- Ankh3-Large
- ESM3c
So the lookup bundle is broader than the built-in Lite embedder. The lookup side is not limited to only two model spaces; the current local embedding CLI is the narrower component.
The current built-in end-to-end Lite pipeline is intended for the last-layer models that the Lite embedder can generate directly today:
- ProtT5
- Ankh3
If you provide externally generated embeddings that match the lookup bundle keys and format, the lookup layer itself is separate and can in principle operate on the additional lookup-table models as well. So the practical distinction is:
- built-in Lite embedding CLI: currently
prot_t5andankh3, with optional layer export - bundled lookup table:
ESM-2,ProstT5,ProtT5,Ankh3-Large, andESM3c, using last-layer reference embeddings
- Input: FASTA files (
.fa,.faa,.fasta) and gzip-compressed versions (.fa.gz,.fasta.gz) - Output: CSV files for results, NPZ files for embeddings, TXT files for TopGO compatibility
- CUDA compatibility: Set
TORCH_INDEXenvironment variable for specific CUDA versions - Memory errors: Use
--serial-modelsand process one model at a time - Missing dependencies: The pipeline automatically installs required packages
- Lookup bundle missing: Download from Zenodo and extract to
data/lookup/ - Out-of-memory errors: Reduce
--embed-modelsto a single model and keep--serial-modelsenabled
- GPU memory: Use
--serial-modelsto prevent multiple models loading simultaneously - Processing speed: Start with
prot_t5model for fastest results - Large files: Increase
--chunk-sizefor better memory management
- Gzipped FASTA files are decompressed on the fly; no manual prep is required
- Sequences longer than the model limit are skipped and logged; other sequences continue
- Each pipeline run creates a timestamped directory (
outputs_YYYYMMDD_HHMMSS) - Parallel model execution is technically possible but rarely worth the memory cost
- Can I use
.gzFASTA files? Yes. Compression is handled automatically. - What if a sequence is too long? It is recorded in
outputs/failed_sequences.csv; the rest of the batch continues. - Does the lookup bundle include ESM3c? Yes. The current Lite lookup bundle includes
ESM3c, but the built-in Lite embedder CLI is still focused onprot_t5andankh3.
FANTASIA Lite V1 is derived from the full FANTASIA pipeline and incorporates methods from GOPredSim. Transformer models are provided via Hugging Face.
Key Publications:
- Performance of protein language models in model organisms
- Application of FANTASIA to functional annotation of dark proteomes
- Protocol explaining FANTASIA
Citing FANTASIA
If you use FANTASIA in your research, please cite the following publications:
-
Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024).Illuminating the functional landscape of the dark proteome across the Animal Tree of Life. DOI: 10.1101/2024.02.28.582465
-
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R., & Rojas, A. M. (2024). Decoding proteome functional information in model organisms using protein language models. DOI: 10.1101/2024.02.14.580341
Main Developers:
- Ana M. Rojas: a.rojas.m@csic.es
- Àlex Domínguez Rodríguez: adomrod4@upo.es
Project Team:
- Ana M. Rojas: a.rojas.m@csic.es
- Rosa Fernández: rosa.fernandez@ibe.upf-csic.es
- Aureliano Bombarely: abombarely@ibmcp.upv.es
- Ildefonso Cases: icasdia@upo.es
- Àlex Domínguez Rodríguez: adomrod4@upo.es
- Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
- Belén Carbonetto: belen.carbonetto.metazomics@gmail.com
- Iñigo de Martín: imartinagirre@gmail.com
- Sofía García Juan
Version: FANTASIA Lite V1
Last Updated: November 2025
Funded by EOSC-OSCARS Fun4Biodiversity