Trias is an encoder-decoder language model trained to reverse-translate protein sequences into codon sequences. It learns codon usage patterns from 10 million mRNA coding sequences across 640 vertebrate species, enabling context-aware sequence generation without requiring handcrafted rules.
Trias uses Python 3.10 and logs training to Weights & Biases.
conda create -n trias python=3.10 && conda activate trias
git clone https://github.com/lareaulab/Trias.git && cd Trias
pip install -e .Benchmarking notebook (optional). notebooks/benchmarking.ipynb also needs:
pip install CodonTransformer
git clone https://github.com/goodarzilab/cdsFM.git
pip install xformersTraining (optional). BartConfig defaults to FlashAttention-2:
pip install flash-attn --no-build-isolationFor CPU or non-flash inference, pass attn_implementation="sdpa" to from_pretrained.
Generate a codon sequence from a protein with the lareaulab/Trias checkpoint. Three decoding modes:
greedy— fast, deterministic.beam— deterministic, explores--beam_widthpaths.nucleus— stochastic, samples from the top---top_p; output differs every run unless you pass--seed.
python scripts/reverse_translation.py \
--model_path lareaulab/Trias \
--protein_sequence "MTEITAAMVKELRESTGAGMMDCKNALSETQ*" \
--species "Homo sapiens" \
--decoding greedyFor beam: add --decoding beam --beam_width 5. For nucleus: add --decoding nucleus --top_p 0.9 (and --seed 42 for reproducibility).
Required columns:
protein— amino acid sequence, must end with*species_name— e.g.,"Homo sapiens"mrna— full mRNA sequencecodon_start,codon_end— 0-based indices of the CDS inmrna
Supported formats: .parquet, .csv, .json.
bash scripts/train_trias.shEdit the script to change model architecture (hidden size, layers, heads) or training hyperparameters (steps, batch size, learning rate).
All figure code lives in notebooks/trias_figures.ipynb. All required data is bundled in data.zip:
unzip data.zipThis extracts a data/ directory with:
| File | Source / use |
|---|---|
GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz |
tissue expression — GTEx Portal V8 |
codon_table.csv |
sRSCU table per species |
human_test_dataset.csv |
held-out test set used for figure metrics |
interpro_output.tsv |
InterPro domain annotations |
train_data_seq_len.csv |
sequence-length distribution of training data |
wandb_training_run.csv |
W&B-exported training curves |
benchmarks/moderna/{gfp,luciferase}.csv |
Bicknell et al. 2024, Cell Reports |
benchmarks/gemorna/{fluc,nanoluc_leppek}.csv |
benchmarks from the GEMORNA paper |
@article{faizi2025,
title={A generative language model decodes contextual constraints on codon choice for mRNA design},
author={Marjan Faizi and Helen Sakharova and Liana F. Lareau},
journal={bioRxiv},
year={2025},
url={https://doi.org/10.1101/2025.05.13.653614}
}