Evolla

a frontier protein–language generative model — because proteins deserve better small talk.

Decoding the molecular language of proteins — generate, predict, and (politely) interrogate sequences.

Try it live: Evolla-10B chat server — no pip install required for curiosity.

Table of contents (pick your adventure)

News
Overview
Use Evolla with Hugging Face Transformers
Full setup (this repository)
Prepare the Evolla model
Prepare input data
Run Evolla
Citation

Hiring: Two PhD spots for international students at Westlake University — details on X. Come help proteins find their words.

News

2026/02/11 Updated the paper Decoding the Molecular Language of Proteins with Evolla with two new sections: Inference of Eukaryotic Complexity in Asgard Archaea by Chatting with Evolla and Discovery of a Novel Deep-sea PET Hydrolase via Evolla.
2025/07/26 Evolla was added to Hugging Face Transformers (model documentation).
2025/04/23 Evolla-80B released on the Hugging Face Hub.
2025/03/12 Evolla-10B-DPO and Evolla-10B-DPO-hf released on the Hugging Face Hub.
2025/02/19 Evolla-10B-hf released on the Hugging Face Hub (Transformers-compatible weights).
2025/01/06 Paper drop: Decoding the Molecular Language of Proteins with Evolla.
2024/12/06 Evolla-10B landed on the Hugging Face Hub (no assembly required beyond git lfs).

Overview

Overview of Evolla.

Use Evolla with Hugging Face Transformers

API reference & examples: Evolla in Hugging Face Transformers (EvollaProcessor, EvollaForProteinText2Text, configs, and tips such as matching aa_seq / foldseek length).

For checkpoints with Hugging Face support (Evolla-10B-hf, Evolla-10B-DPO-hf), you can load Evolla like any other Hub model: install PyTorch and a Transformers release that includes EvollaProcessor and EvollaForProteinText2Text, then from_pretrained the model id.

You do not need to clone this repository, run environment.sh, or download the SaProt / Llama checkpoints in the sections below—the Hub weights match the Transformers API.

from transformers import EvollaProcessor, EvollaForProteinText2Text

model_id = "westlake-repl/Evolla-10B-hf"
processor = EvollaProcessor.from_pretrained(model_id)
model = EvollaForProteinText2Text.from_pretrained(
    model_id,
    device_map="auto",
).eval()

# Build protein_informations (aa_seq, foldseek) and chat messages, then:
# inputs = processor(protein_informations, messages_list, return_tensors="pt").to(model.device)
# outputs = model.generate(**inputs)

Adjust device_map, dtype, and generation settings for your hardware. For TSV-style batch inference inside a checkout of this repo, see scripts/inference_hf.py (multi-GPU uses torch.distributed).

Full setup (this repository): from Environment installation through Run Evolla describes the stack maintained here—conda env + environment.sh, local clones of Evolla-10B non-hf weights plus SaProt and Llama, and scripts/inference.py.

Environment installation

Create a virtual environment

conda create -n Evolla python=3.10
conda activate Evolla

Install packages

bash environment.sh

Prepare the Evolla model

Pre-trained Evolla-10B lives on the Hugging Face Hub. Clone the checkpoints (grab coffee if the network is shy):

cd ckpt/huggingface

git lfs install

git clone https://huggingface.co/westlake-repl/Evolla-10B

git clone https://huggingface.co/westlake-repl/SaProt_650M_AF2

git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Model checkpoints

All rows are Hugging Face Hub repos. Names ending in ‑hf ship the Transformers-compatible layout (EvollaProcessor / EvollaForProteinText2Text; see the model documentation). The rest are the original checkpoints used with this repo’s stack—clone them alongside SaProt and Llama as above, then run scripts/inference.py.

Params	Training objective	`transformers`-compatible
10B	Causal protein–language modeling (CPLM)	—
10B	Causal protein–language modeling (CPLM)	✓
10B	Direct preference optimization (DPO)	—
10B	Direct preference optimization (DPO)	✓
80B	Causal protein–language modeling (CPLM)	—

Prepare input data

Evolla batch inference expects a TSV like examples/inputs.tsv. To build it from structures (PDB or mmCIF) instead of hand-writing sequences:

Foldseek — Install a working foldseek binary and pass its path to the helper script (default in the script is only an example; use your own install).
Structures — Put the chains you care about in one directory per batch (.pdb / .cif files).
Questions — Supply prompts with repeated --question "..." and/or a --questions-file with one question per line (blank lines ignored). You can combine both.

Multiple questions → Cartesian product. get_input_files.py writes one row per (structure file, question) pair. If you have m questions and n proteins (structure files) in a directory, the generated TSV has m×n lines—every question is paired with every structure.

From the repo root, generate a TSV (see python scripts/get_input_files.py --help for all flags):

PYTHONPATH=. python scripts/get_input_files.py \
  --foldseek /path/to/foldseek \
  --structure-result path/to/structures_dir path/to/output.tsv \
  --question "What is the catalytic activity of this protein?"

Use --rewrite to overwrite an existing output file. Repeat --structure-result DIR OUT.tsv for multiple directory → output pairs.

Confidence-aware Foldseek tokens. Predicted structures (for example from AlphaFold) are not uniformly reliable at every residue. To avoid over-weighting uncertain regions, get_input_files.py follows the convention used in this project: where per-residue pLDDT is below 70, the corresponding Foldseek 3Di character is replaced with # (masking), so low-confidence geometry contributes less to the structural string. If you build the foldseek_sequence column yourself from raw Foldseek output without a similar mask, the distribution may diverge somewhat from what the model saw during training, and answers can be a bit less reliable than with this preprocessing. For best alignment with the released checkpoints, we recommend keeping the same masking rule when you can.

TSV format

Each row is tab-separated: (protein_id, aa_sequence, foldseek_sequence, question_in_json_string).

Column	Meaning
`protein_id`	Row id
`aa_sequence`	Amino acid sequence
`foldseek_sequence`	Same chain in FoldSeek format
`question_in_json_string`	Question, serialized with `json.dumps`

Run Evolla

Use `inference.py` (default stack in this repo)

Runs the project config and non-hf checkpoints prepared above. From the repo root — swap /your/path/to/Evolla for your clone:

cd /your/path/to/Evolla
python scripts/inference.py --config_path config/Evolla_10B.yaml --input_path examples/inputs.tsv

Use `inference_hf.py` (Transformers / `-hf` weights)

If you cloned the repo but want Hub Evolla-*-hf models via EvollaForProteinText2Text, use scripts/inference_hf.py for TSV batching (see --help; the script uses torch.distributed—launch accordingly for your GPU layout).

Citation

If this repo saved you a weekend, please cite:

@article{zhou2025decoding,
  title={Decoding the Molecular Language of Proteins with Evolla},
  author={Zhou, Xibin and Han, Chenchen and Zhang, Yingqi and Su, Jin and Zhuang, Kai and Jiang, Shiyu and Yuan, Zichen and Zheng, Wei and Dai, Fengyuan and Zhou, Yuyang and others},
  journal={bioRxiv},
  pages={2025--01},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

Other resources

ProTrek (Nature Biotechnology) — server
Pinal — server
SaprotHub (Nature Biotechnology) — Colab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evolla

News

Overview

Use Evolla with Hugging Face Transformers

Environment installation

Create a virtual environment

Install packages

Prepare the Evolla model

Model checkpoints

Prepare input data

TSV format

Run Evolla

Use `inference.py` (default stack in this repo)

Use `inference_hf.py` (Transformers / `-hf` weights)

Citation

Other resources

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ckpt		ckpt
config		config
examples		examples
figures		figures
model		model
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.sh		environment.sh

Folders and files

Latest commit

History

Repository files navigation

Evolla

News

Overview

Use Evolla with Hugging Face Transformers

Environment installation

Create a virtual environment

Install packages

Prepare the Evolla model

Model checkpoints

Prepare input data

TSV format

Run Evolla

Use inference.py (default stack in this repo)

Use inference_hf.py (Transformers / -hf weights)

Citation

Other resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

Use `inference.py` (default stack in this repo)

Use `inference_hf.py` (Transformers / `-hf` weights)