Extracting Spatial Protein Representations via Interpretability Tensors
ESPRIT probes the internal representations of ESM3, a protein language model, by computing concept vectors — directions in hidden-state space that correspond to known biological properties of amino acid residues. Given a new protein sequence, ESPRIT runs ESM3 inference, projects each residue's hidden state onto these concept directions, and renders the results as an interactive 3D heatmap on the protein structure.
- Overview
- Biological Concepts
- Architecture
- Tech Stack
- Project Structure
- Getting Started
- API Reference
- How It Works
ESPRIT follows a three-stage workflow:
- Data collection — Download ~2,750 protein structures from the RCSB Protein Data Bank (PDB) and extract per-residue biological features (secondary structure, solvent accessibility, binding sites, etc.) into Parquet datasets.
- Concept vector extraction — Run ESM3 forward passes with hooks on every transformer layer, collect hidden states at labeled residue positions, and compute a concept direction per layer as the mean-difference between positive and negative classes.
- Interactive inference — Serve a FastAPI backend that accepts arbitrary protein sequences, runs ESM3, projects residue hidden states onto concept vectors via cosine similarity, and streams the results to a React frontend that renders them on a 3Dmol.js structure viewer.
ESPRIT ships with nine concept probes:
| Concept | Positive class | Negative class |
|---|---|---|
| Disulfide bonds | Disulfide-bonded cysteines | Free cysteines |
| Helix | Helical residues | Coil residues |
| Sheet | Sheet residues | Coil residues |
| Helix vs Sheet | Helical residues | Sheet residues |
| Solvent accessibility | Buried residues (low SASA) | Exposed residues (high SASA) |
| Protein-protein interaction | Interface residues | Non-interface residues |
| Ligand/metal binding | Binding-site residues | Non-binding residues |
| Post-translational modifications | Modified residues | Unmodified residues |
| Disorder | Disordered residues | Ordered residues |
┌────────────────────────────────────────────────────────┐
│ React + 3Dmol.js Frontend │
│ Interactive 3D protein viewer with concept heatmaps │
└───────────────────┬────────────────────────────────────┘
│ REST
┌───────────────────▼────────────────────────────────────┐
│ FastAPI Backend (api.py) │
│ Serves concept vectors & runs live ESM3 inference │
└───────────────────┬────────────────────────────────────┘
│
┌───────────────────▼────────────────────────────────────┐
│ ESM3 Inference Engine (inference.py) │
│ Forward hooks on all transformer layers │
│ Cosine similarity projection onto concept directions │
└───────────────────┬────────────────────────────────────┘
│
┌───────────────────▼────────────────────────────────────┐
│ Data Pipeline (Modal / local) │
│ download_structures → extract_features → │
│ extract_concept_vector → validate_concept_vector │
└────────────────────────────────────────────────────────┘
Backend — Python 3.12, FastAPI, PyTorch, ESM3, BioPython, PyArrow, Modal
Frontend — React 19, TypeScript, Vite, Tailwind CSS 4, 3Dmol.js
Infrastructure — Modal (GPU cloud compute), Hugging Face (ESM3 model weights), RCSB PDB (protein structures)
├── api.py # FastAPI server
├── inference.py # ESM3 engine with forward hooks
├── extract_features.py # BioPython feature extraction from mmCIF
├── extract_concept_vector.py # Per-layer concept vector computation
├── validate_concept_vector.py # ROC AUC & linear probe evaluation
├── download_structures.py # RCSB PDB search & download
├── modal_app.py # Modal cloud orchestration
├── concepts.py # Concept definitions & utilities
├── main.py # CLI entry point
├── pyproject.toml # Python dependencies
│
├── web/ # Frontend
│ ├── src/
│ │ ├── App.tsx # Main application component
│ │ ├── ProteinViewer.tsx # 3Dmol.js 3D viewer
│ │ └── pdbParser.ts # PDB file parser
│ ├── package.json
│ └── vite.config.ts
│
└── data/ # Generated artifacts (not committed)
├── cif_files/ # Downloaded mmCIF structures
├── *_features.parquet # Per-concept feature tables
├── *_concept_vectors.pt # Per-layer concept direction tensors
└── download_manifest.json # PDB search results log
- Python 3.12
- uv (recommended) or pip
- Node.js 18+ (for the frontend)
- A Hugging Face account with access to the ESM3 model
- A Modal account (for cloud pipeline; optional for local-only use)
# Clone the repo
git clone <repo-url> && cd bioxai-hackathon
# Install Python dependencies
uv sync # or: pip install -e .
# Install frontend dependencies
cd web
npm install # or: bun install
cd ..Modal setup (for cloud pipeline):
modal setup
modal secret create huggingface HF_TOKEN=hf_<your_token>Full pipeline on Modal (recommended):
modal run modal_app.py::run_full_pipelineIndividual stages can also be run separately:
modal run modal_app.py::download_only # Download PDB structures
modal run modal_app.py::extract_only # Extract features to Parquet
modal run modal_app.py::concept_vector_only # Compute concept vectorsLocal feature extraction (no GPU required):
python extract_features.py # Extract all features
python extract_concept_vector.py <concept> # Compute vectors for a concept
python validate_concept_vector.py <concept> # Evaluate with ROC AUCWhere <concept> is one of: disulfide, ss_helix, ss_sheet, ss_helix_sheet, sasa, ppi, binding, ptm, disorder.
Sync data from Modal to local:
modal volume get bioxai-data /data ./dataStart the backend and frontend in separate terminals:
# Terminal 1 — API server (port 8000)
python api.py
# Terminal 2 — Frontend dev server
cd web
npm run devOpen the URL printed by Vite (typically http://localhost:5173). Paste a protein sequence, select a concept and layer, and view the cosine-similarity heatmap on the 3D structure.
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Model readiness status |
GET |
/concepts |
List available concepts with metadata |
GET |
/concepts/{name}/vectors |
Per-layer concept vector norms |
GET |
/concepts/{name}/features |
Preview of the feature Parquet |
POST |
/inference/project |
Run ESM3 inference and project onto concept vectors |
POST /inference/project accepts a JSON body:
{
"sequence": "MKTLLILAVL...",
"concept": "ss_helix",
"layer": 12
}Returns per-residue cosine similarity scores between hidden states and the selected concept direction.
For each concept (e.g., "helix"), the pipeline:
- Loads labeled residue indices from the feature Parquet (positive = helix, negative = coil).
- Runs ESM3 on each chain, capturing hidden states at every transformer layer via forward hooks.
- Collects hidden-state vectors at positive and negative residue positions.
- Computes the concept direction per layer: v = mean(h_pos) - mean(h_neg).
- L2-normalizes the direction for stable cosine projections.
Given a new sequence:
- Tokenize and run ESM3 with hooks active.
- For each residue at the selected layer, compute cos(h_residue, v_concept).
- Return the similarity vector to the frontend, which maps it to a color scale on the 3D structure.
Each concept vector is evaluated on held-out chains using:
- ROC AUC — treating cosine similarity as a binary classifier score.
- Linear probe accuracy — fitting a logistic regression on the projected scalar.
These metrics measure how well the concept direction separates positive from negative residues in unseen proteins.