a frontier protein–language generative model — because proteins deserve better small talk.
Decoding the molecular language of proteins — generate, predict, and (politely) interrogate sequences.
Try it live: Evolla-10B chat server — no pip install required for curiosity.
Table of contents (pick your adventure)
Hiring: Two PhD spots for international students at Westlake University — details on X. Come help proteins find their words.
- 2026/02/11 Updated the paper Decoding the Molecular Language of Proteins with Evolla with two new sections: Inference of Eukaryotic Complexity in Asgard Archaea by Chatting with Evolla and Discovery of a Novel Deep-sea PET Hydrolase via Evolla.
- 2025/07/26 Evolla was added to Hugging Face Transformers (model documentation).
- 2025/04/23 Evolla-80B released on the Hugging Face Hub.
- 2025/03/12 Evolla-10B-DPO and Evolla-10B-DPO-hf released on the Hugging Face Hub.
- 2025/02/19 Evolla-10B-hf released on the Hugging Face Hub (Transformers-compatible weights).
- 2025/01/06 Paper drop: Decoding the Molecular Language of Proteins with Evolla.
- 2024/12/06 Evolla-10B landed on the Hugging Face Hub (no assembly required beyond
git lfs).
API reference & examples: Evolla in Hugging Face Transformers (EvollaProcessor, EvollaForProteinText2Text, configs, and tips such as matching aa_seq / foldseek length).
For checkpoints with Hugging Face support (Evolla-10B-hf, Evolla-10B-DPO-hf), you can load Evolla like any other Hub model: install PyTorch and a Transformers release that includes EvollaProcessor and EvollaForProteinText2Text, then from_pretrained the model id.
You do not need to clone this repository, run environment.sh, or download the SaProt / Llama checkpoints in the sections below—the Hub weights match the Transformers API.
from transformers import EvollaProcessor, EvollaForProteinText2Text
model_id = "westlake-repl/Evolla-10B-hf"
processor = EvollaProcessor.from_pretrained(model_id)
model = EvollaForProteinText2Text.from_pretrained(
model_id,
device_map="auto",
).eval()
# Build protein_informations (aa_seq, foldseek) and chat messages, then:
# inputs = processor(protein_informations, messages_list, return_tensors="pt").to(model.device)
# outputs = model.generate(**inputs)Adjust device_map, dtype, and generation settings for your hardware. For TSV-style batch inference inside a checkout of this repo, see scripts/inference_hf.py (multi-GPU uses torch.distributed).
Full setup (this repository): from Environment installation through Run Evolla describes the stack maintained here—conda env + environment.sh, local clones of Evolla-10B non-hf weights plus SaProt and Llama, and scripts/inference.py.
conda create -n Evolla python=3.10
conda activate Evollabash environment.shPre-trained Evolla-10B lives on the Hugging Face Hub. Clone the checkpoints (grab coffee if the network is shy):
cd ckpt/huggingface
git lfs install
git clone https://huggingface.co/westlake-repl/Evolla-10B
git clone https://huggingface.co/westlake-repl/SaProt_650M_AF2
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-InstructAll rows are Hugging Face Hub repos. Names ending in ‑hf ship the Transformers-compatible layout (EvollaProcessor / EvollaForProteinText2Text; see the model documentation). The rest are the original checkpoints used with this repo’s stack—clone them alongside SaProt and Llama as above, then run scripts/inference.py.
Evolla batch inference expects a TSV like examples/inputs.tsv. To build it from structures (PDB or mmCIF) instead of hand-writing sequences:
- Foldseek — Install a working
foldseekbinary and pass its path to the helper script (default in the script is only an example; use your own install). - Structures — Put the chains you care about in one directory per batch (
.pdb/.ciffiles). - Questions — Supply prompts with repeated
--question "..."and/or a--questions-filewith one question per line (blank lines ignored). You can combine both.
Multiple questions → Cartesian product. get_input_files.py writes one row per (structure file, question) pair. If you have m questions and n proteins (structure files) in a directory, the generated TSV has m×n lines—every question is paired with every structure.
From the repo root, generate a TSV (see python scripts/get_input_files.py --help for all flags):
PYTHONPATH=. python scripts/get_input_files.py \
--foldseek /path/to/foldseek \
--structure-result path/to/structures_dir path/to/output.tsv \
--question "What is the catalytic activity of this protein?"Use --rewrite to overwrite an existing output file. Repeat --structure-result DIR OUT.tsv for multiple directory → output pairs.
Confidence-aware Foldseek tokens. Predicted structures (for example from AlphaFold) are not uniformly reliable at every residue. To avoid over-weighting uncertain regions, get_input_files.py follows the convention used in this project: where per-residue pLDDT is below 70, the corresponding Foldseek 3Di character is replaced with # (masking), so low-confidence geometry contributes less to the structural string. If you build the foldseek_sequence column yourself from raw Foldseek output without a similar mask, the distribution may diverge somewhat from what the model saw during training, and answers can be a bit less reliable than with this preprocessing. For best alignment with the released checkpoints, we recommend keeping the same masking rule when you can.
Each row is tab-separated: (protein_id, aa_sequence, foldseek_sequence, question_in_json_string).
| Column | Meaning |
|---|---|
protein_id |
Row id |
aa_sequence |
Amino acid sequence |
foldseek_sequence |
Same chain in FoldSeek format |
question_in_json_string |
Question, serialized with json.dumps |
Runs the project config and non-hf checkpoints prepared above. From the repo root — swap /your/path/to/Evolla for your clone:
cd /your/path/to/Evolla
python scripts/inference.py --config_path config/Evolla_10B.yaml --input_path examples/inputs.tsvIf you cloned the repo but want Hub Evolla-*-hf models via EvollaForProteinText2Text, use scripts/inference_hf.py for TSV batching (see --help; the script uses torch.distributed—launch accordingly for your GPU layout).
If this repo saved you a weekend, please cite:
@article{zhou2025decoding,
title={Decoding the Molecular Language of Proteins with Evolla},
author={Zhou, Xibin and Han, Chenchen and Zhang, Yingqi and Su, Jin and Zhuang, Kai and Jiang, Shiyu and Yuan, Zichen and Zheng, Wei and Dai, Fengyuan and Zhou, Yuyang and others},
journal={bioRxiv},
pages={2025--01},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}