Statistical modeling of the relationship between mRNA and protein abundances at single-cell resolution using XX distributions and log-normal noise models.
This project models protein abundance from mRNA counts in single cells by:
- Fitting negative binomial (NB) distributions to mRNA count data
- Simulating protein levels using a log-normal multiplicative noise model
- Jointly optimizing mRNA and protein parameters to match observed distributions
The core model assumes:
P = (c * m + ε) * LogNormal(0, σ)
where m is mRNA count (NB distributed), c is a scaling factor, ε prevents log(0), and σ captures biological/technical noise.
├── models/
│ ├── APPROACHES.md # Modeling approaches and equations
│ └── Model_V1.ipynb # Main analysis notebook (Google Colab)
├── preprocessing/
│ └── preproc_and_filter.R # R preprocessing script for protein and mRNA data
├── dat/ # Local data directory (optional)
└── requirements.txt # Python dependencies
Raw sequencing data is available on GEO: GSE244215
Processed data for modeling is stored on Google Drive: Data Folder
This data is all for 1 cell type, basal cells (see preproc_and_filter.R)
Drive/
└── Protein_RNA_Modeling/
├── raw/ # mRNA counts, Raw protein data, metadata for both
├── processed/ # Cleaned/filtered protein data for our modeling purposes
├── results/ # Model outputs and figures
└── notebooks/ # Working copies of Colab notebooks
- GitHub source links (used by the Colab badge) will break if files are moved/renamed in the repo — update the badge URL if you reorganize.
- Google Drive links are based on file ID, not path — notebooks can be moved freely within Drive without breaking share links.
The notebook includes a data loading cell that mounts Google Drive and loads the data. Just update the DATA_PATH variable to point to your data folder.
- Open Model_V1.ipynb in Google Colab
- Run the first cell to mount Google Drive
- Update
DATA_PATHto point to your data folder - Run cells sequentially
| Function | Description |
|---|---|
fit_nb_mle(x) |
Fit negative binomial to mRNA counts via MLE |
simulate_protein_log2fc_from_mrna_nb() |
Simulate protein log2FC from mRNA model |
fit_sigma_log_to_protein() |
Grid search for optimal noise parameter σ |
joint_fit_nb_and_sigma() |
Joint optimization of NB and noise parameters |
Python (Colab):
- numpy
- scipy
- matplotlib
R (preprocessing):
- Seurat
- stringr
- seqinr
- dplyr
# Fit NB to mRNA counts
mu_hat, r_hat, _ = fit_nb_mle(x_mrna)
# Find optimal sigma to match observed protein distribution
best_sigma = fit_sigma_log_to_protein(y_obs, mu_hat, r_hat)
# Or jointly optimize all parameters
result = joint_fit_nb_and_sigma(
x_mrna,
y_obs,
t_half_m_hours=2.0, # mRNA half-life
t_half_p_hours=24.0 # protein half-life
)MIT