Skip to content

feat: Inline ColabFold and AlphaFold2 into BioEmu#206

Merged
josejimenezluna merged 2 commits intomainfrom
sarahlewis/inline-colabfold
Mar 30, 2026
Merged

feat: Inline ColabFold and AlphaFold2 into BioEmu#206
josejimenezluna merged 2 commits intomainfrom
sarahlewis/inline-colabfold

Conversation

@sarahnlewis
Copy link
Copy Markdown
Collaborator

@sarahnlewis sarahnlewis commented Mar 22, 2026

Summary

Replace the subprocess-based ColabFold integration (separate venv, patched site-packages, setup.sh) with code inlined directly into the BioEmu package. Everything now runs in a single Python environment.

What changed

Removed

  • src/bioemu/colabfold_setup/ (setup.sh, batch.patch, modules.patch)
  • Subprocess calls to colabfold_batch
  • Separate ColabFold venv (~/.bioemu_colabfold) is no longer needed

Added: src/bioemu/colabfold_inline/

  • msa_client.py: MMseqs2 API client (from colabfold.colabfold, MIT)
  • input_parsing.py: FASTA/A3M parser (from colabfold.batch, MIT)
  • features.py: Monomer feature pipeline wrapping vendored alphafold
  • model_runner.py: AF2 forward pass orchestration, weight downloading

Added: src/_vendor/

  • alphafold/: Vendored, patched subset of AlphaFold2 v2.3.2 (Apache 2.0). Patched modules.py to expose representations_evo. Removed ~13,000 lines of unused code (structure module, multimer, relax, templates, data tools).
  • openfold/: Moved from src/bioemu/openfold/ with same sys.modules aliasing.

CI

  • Replaced conda with uv (astral-sh/setup-uv@v4)
  • Test matrix: Python 3.10, 3.11, 3.12, 3.13
  • JAX_PLATFORMS=cpu for CPU-only CI runners
  • GPU regression tests skipped in CI (require weights + GPU)

Dependencies

  • JAX, Haiku, ml-collections, TF now required deps (not optional)
  • [cuda] extra for GPU-specific packages (jax[cuda12], nvcc)
  • Python 3.12 upper bound removed

Tests

  • 94 tests passing in CI (+ 2 GPU regression tests on GPU hosts)
  • GPU regression tests verify embeddings match main branch (correlation >0.9999)
  • Mocks target _run_model only; feature building runs for real

License compliance

  • Vendored AF2 files retain original DeepMind copyright + Apache 2.0 Section 4(b) modification notices
  • ColabFold-derived files carry MIT attribution headers
  • NOTICE.md and cgmanifest.json updated

E2E verified on GPU

python -m bioemu.sample --sequence GYDPETGTWG --num_samples 2

Produces correct PDB + XTC output with fresh cache (no precomputed embeds or MSAs).

@sarahnlewis sarahnlewis marked this pull request as draft March 22, 2026 21:14
@sarahnlewis sarahnlewis force-pushed the sarahlewis/inline-colabfold branch from f715a1c to 5f8a7ca Compare March 22, 2026 21:19
Replace the subprocess-based ColabFold integration (separate venv, patched
site-packages, setup.sh) with code inlined directly into the BioEmu package.
Everything now runs in a single Python environment.

## What changed

### Removed
- src/bioemu/colabfold_setup/ (setup.sh, batch.patch, modules.patch)
- Subprocess calls to colabfold_batch
- Separate ColabFold venv (~/.bioemu_colabfold) is no longer needed

### Added: src/bioemu/colabfold_inline/
- msa_client.py: MMseqs2 API client (from colabfold.colabfold, MIT)
- input_parsing.py: FASTA/A3M parser (from colabfold.batch, MIT)
- features.py: Monomer feature pipeline wrapping vendored alphafold
- model_runner.py: AF2 forward pass orchestration, weight downloading
- LICENSES/: ColabFold MIT + AlphaFold2 Apache 2.0 license texts

### Added: src/_vendor/alphafold/
Vendored, patched subset of AlphaFold2 v2.3.2 (Apache 2.0):
- Evoformer and model runner (the forward pass)
- Patched modules.py to expose representations_evo
- Removed: structure module, multimer, relax, templates, data tools
  (~13,000 lines of unused code deleted)
- Registered via sys.modules aliasing (no sys.path manipulation)

### Modified
- get_embeds.py: Calls inlined code directly instead of subprocess
- pyproject.toml: JAX, Haiku, ml-collections, TF now required deps;
  [cuda] extra for GPU-specific packages
- README.md: Updated install instructions, removed Python 3.12 cap,
  clarified conda only needed for optional hpacker
- NOTICE.md, cgmanifest.json: Added ColabFold + AlphaFold2 attribution

### Tests
- 96 tests passing (54 new tests for inlined code)
- GPU regression tests verify embeddings match main branch
  (correlation >0.9999, per-residue cosine similarity >0.999)
- Mocks target _run_model (JAX forward pass) only; feature building
  runs for real in unit tests

## License compliance
- Vendored AF2 files retain original DeepMind copyright headers
- Modified files carry Apache 2.0 Section 4(b) change notices
- ColabFold-derived files carry MIT attribution headers
- Full license texts in LICENSES/ and src/_vendor/alphafold/LICENSE

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sarahnlewis sarahnlewis force-pushed the sarahlewis/inline-colabfold branch from 5f8a7ca to aefe217 Compare March 22, 2026 21:31
@github-actions
Copy link
Copy Markdown

Summary

Summary
Generated on: 03/22/2026 - 21:47:12
Parser: Cobertura
Assemblies: 4
Classes: 27
Files: 27
Line coverage: 85.5% (1836 of 2146)
Covered lines: 1836
Uncovered lines: 310
Coverable lines: 2146
Total lines: 6800
Covered branches: 0
Total branches: 0
Method coverage: Feature is only available for sponsors

Coverage

src.bioemu - 88.8%
Name Line Branch
src.bioemu 88.8% ****
init.py 100%
chemgraph.py 100%
convert_chemgraph.py 97%
denoiser.py 98.1%
get_embeds.py 80.3%
md_utils.py 85.8%
model_utils.py 78%
models.py 94.1%
run_hpacker.py 0%
sample.py 88.3%
sde_lib.py 86.6%
seq_io.py 100%
shortcuts.py 100%
sidechain_relax.py 77.2%
so3_sde.py 91.7%
steering.py 90.7%
structure_module.py 84.3%
utils.py 65.6%
src.bioemu.colabfold_inline - 62%
Name Line Branch
src.bioemu.colabfold_inline 62% ****
init.py
features.py 100%
input_parsing.py 100%
model_runner.py 49%
msa_client.py 60.8%
src.bioemu.hpacker_setup - 58.8%
Name Line Branch
src.bioemu.hpacker_setup 58.8% ****
init.py
setup_hpacker.py 58.8%
src.bioemu.training - 100%
Name Line Branch
src.bioemu.training 100% ****
foldedness.py 100%
loss.py 100%

@josejimenezluna josejimenezluna marked this pull request as ready for review March 30, 2026 09:12
@github-actions
Copy link
Copy Markdown

Summary

Summary
Generated on: 03/30/2026 - 09:57:19
Parser: Cobertura
Assemblies: 4
Classes: 27
Files: 27
Line coverage: 85.5% (1836 of 2146)
Covered lines: 1836
Uncovered lines: 310
Coverable lines: 2146
Total lines: 6800
Covered branches: 0
Total branches: 0
Method coverage: Feature is only available for sponsors

Coverage

src.bioemu - 88.8%
Name Line Branch
src.bioemu 88.8% ****
init.py 100%
chemgraph.py 100%
convert_chemgraph.py 97%
denoiser.py 98.1%
get_embeds.py 80.3%
md_utils.py 85.8%
model_utils.py 78%
models.py 94.1%
run_hpacker.py 0%
sample.py 88.3%
sde_lib.py 86.6%
seq_io.py 100%
shortcuts.py 100%
sidechain_relax.py 77.2%
so3_sde.py 91.7%
steering.py 90.7%
structure_module.py 84.3%
utils.py 65.6%
src.bioemu.colabfold_inline - 62%
Name Line Branch
src.bioemu.colabfold_inline 62% ****
init.py
features.py 100%
input_parsing.py 100%
model_runner.py 49%
msa_client.py 60.8%
src.bioemu.hpacker_setup - 58.8%
Name Line Branch
src.bioemu.hpacker_setup 58.8% ****
init.py
setup_hpacker.py 58.8%
src.bioemu.training - 100%
Name Line Branch
src.bioemu.training 100% ****
foldedness.py 100%
loss.py 100%

@josejimenezluna josejimenezluna merged commit 2aa054f into main Mar 30, 2026
7 checks passed
@josejimenezluna josejimenezluna deleted the sarahlewis/inline-colabfold branch March 30, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants