Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Trias is an encoder-decoder language model trained to reverse-translate protein sequences into codon sequences. It learns codon usage patterns from 10 million mRNA coding sequences across 640 vertebrate species, enabling context-aware sequence generation without requiring handcrafted rules.

Setup

Trias uses Python 3.10 and logs training to Weights & Biases.

conda create -n trias python=3.10 && conda activate trias
git clone https://github.com/lareaulab/Trias.git && cd Trias
pip install -e .

Benchmarking notebook (optional). notebooks/benchmarking.ipynb also needs:

pip install CodonTransformer
git clone https://github.com/goodarzilab/cdsFM.git
pip install xformers

Training (optional). BartConfig defaults to FlashAttention-2:

pip install flash-attn --no-build-isolation

For CPU or non-flash inference, pass attn_implementation="sdpa" to from_pretrained.

Reverse translation

Generate a codon sequence from a protein with the lareaulab/Trias checkpoint. Three decoding modes:

greedy — fast, deterministic.
beam — deterministic, explores --beam_width paths.
nucleus — stochastic, samples from the top---top_p; output differs every run unless you pass --seed.

python scripts/reverse_translation.py \
  --model_path lareaulab/Trias \
  --protein_sequence "MTEITAAMVKELRESTGAGMMDCKNALSETQ*" \
  --species "Homo sapiens" \
  --decoding greedy

For beam: add --decoding beam --beam_width 5. For nucleus: add --decoding nucleus --top_p 0.9 (and --seed 42 for reproducibility).

Dataset format

Required columns:

protein — amino acid sequence, must end with *
species_name — e.g., "Homo sapiens"
mrna — full mRNA sequence
codon_start, codon_end — 0-based indices of the CDS in mrna

Supported formats: .parquet, .csv, .json.

Training

bash scripts/train_trias.sh

Edit the script to change model architecture (hidden size, layers, heads) or training hyperparameters (steps, batch size, learning rate).

Reproducing figures

All figure code lives in notebooks/trias_figures.ipynb. All required data is bundled in data.zip:

unzip data.zip

This extracts a data/ directory with:

File	Source / use
`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz`	tissue expression — GTEx Portal V8
`codon_table.csv`	sRSCU table per species
`human_test_dataset.csv`	held-out test set used for figure metrics
`interpro_output.tsv`	InterPro domain annotations
`train_data_seq_len.csv`	sequence-length distribution of training data
`wandb_training_run.csv`	W&B-exported training curves
`benchmarks/moderna/{gfp,luciferase}.csv`	Bicknell et al. 2024, Cell Reports
`benchmarks/gemorna/{fluc,nanoluc_leppek}.csv`	benchmarks from the GEMORNA paper

Citation

@article{faizi2025,
  title={A generative language model decodes contextual constraints on codon choice for mRNA design},
  author={Marjan Faizi and Helen Sakharova and Liana F. Lareau},
  journal={bioRxiv},
  year={2025},
  url={https://doi.org/10.1101/2025.05.13.653614}
}

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
examples/dummy_dataset		examples/dummy_dataset
notebooks		notebooks
scripts		scripts
src/trias		src/trias
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
overview.png		overview.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Setup

Reverse translation

Dataset format

Training

Reproducing figures

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Setup

Reverse translation

Dataset format

Training

Reproducing figures

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages