Skip to content

Commit 3fdf09c

Browse files
committed
add huggingface/aido.cell
1 parent 1a21f66 commit 3fdf09c

17 files changed

Lines changed: 22463 additions & 11 deletions

.pre-commit-config.yaml

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,20 @@ repos:
44
hooks:
55
- id: ruff
66
args: [--config, pyproject.toml, --fix]
7+
exclude: ^huggingface/
78
- id: ruff-format
89
args: [--config, pyproject.toml]
10+
exclude: ^huggingface/
911
- repo: https://github.com/pre-commit/pre-commit-hooks
1012
rev: v5.0.0
1113
hooks:
1214
- id: trailing-whitespace
13-
exclude: ^modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)
15+
exclude: ^(modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)|huggingface/)
1416
- id: end-of-file-fixer
15-
exclude: ^modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)
17+
exclude: ^(modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)|huggingface/)
1618
- id: check-yaml
17-
exclude: ^modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)
19+
exclude: ^(modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)|huggingface/)
1820
- id: debug-statements
19-
exclude: ^modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)
21+
exclude: ^(modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)|huggingface/)
2022
- id: check-added-large-files
21-
exclude: ^modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)
22-
- repo: https://github.com/python-poetry/poetry
23-
rev: 2.1.2
24-
hooks:
25-
- id: poetry-check
26-
- id: poetry-lock
23+
exclude: ^(modelgenerator/(huggingface_models|prot_inv_fold|rna_inv_fold|rna_ss|structure_tokenizer)|huggingface/)

huggingface/aido.cell/.gitignore

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# UV
98+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
uv.lock
102+
103+
# poetry
104+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105+
# This is especially recommended for binary packages to ensure reproducibility, and is more
106+
# commonly ignored for libraries.
107+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108+
#poetry.lock
109+
110+
# pdm
111+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112+
#pdm.lock
113+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114+
# in version control.
115+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116+
.pdm.toml
117+
.pdm-python
118+
.pdm-build/
119+
120+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121+
__pypackages__/
122+
123+
# Celery stuff
124+
celerybeat-schedule
125+
celerybeat.pid
126+
127+
# SageMath parsed files
128+
*.sage.py
129+
130+
# Environments
131+
.env
132+
.venv
133+
env/
134+
venv/
135+
ENV/
136+
env.bak/
137+
venv.bak/
138+
139+
# Spyder project settings
140+
.spyderproject
141+
.spyproject
142+
143+
# Rope project settings
144+
.ropeproject
145+
146+
# mkdocs documentation
147+
/site
148+
149+
# mypy
150+
.mypy_cache/
151+
.dmypy.json
152+
dmypy.json
153+
154+
# Pyre type checker
155+
.pyre/
156+
157+
# pytype static type analyzer
158+
.pytype/
159+
160+
# Cython debug symbols
161+
cython_debug/
162+
163+
# PyCharm
164+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166+
# and can be added to the global gitignore or merged into this file. For a more nuclear
167+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
168+
#.idea/
169+
170+
# Pre-commit
171+
.ruff_cache

huggingface/aido.cell/README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# AIDO.Cell
2+
3+
Standalone AIDO.Cell model repo using HuggingFace handles.
4+
5+
## Installation
6+
7+
```bash
8+
# Barebones installation
9+
pip install -e .
10+
11+
# FlashAttention2 support
12+
pip install -e ".[flash_attn]"
13+
14+
# PEFT/LoRA support
15+
pip install -e ".[peft]"
16+
17+
# All optional dependencies
18+
pip install -e ".[flash_attn,peft]"
19+
```
20+
21+
## Quickstart
22+
23+
1. **Edit the configuration** in `embed.py`:
24+
25+
```python
26+
# CONFIGURATION - Set these variables
27+
MODEL_NAME = "genbio-ai/AIDO.Cell-3M" # Or "genbio-ai/AIDO.Cell-100M"
28+
INPUT_FILE = "temp_adata.h5ad" # Path to your input file
29+
OUTPUT_FILE = None # Auto-generates: input_embeddings.h5ad
30+
DEVICE = "cuda" # "cuda" or "cpu"
31+
BATCH_SIZE = 32
32+
EMBEDDING_KEY = "X_aido_cell"
33+
```
34+
35+
2. **Run the script**:
36+
37+
```bash
38+
python embed.py
39+
```
40+
41+
## Finetune with LoRA
42+
43+
> **Note**: Fine-tuning requires the `peft` optional dependency. Install with: `uv pip install -e ".[peft]"`
44+
45+
1. **Edit the configuration** in `finetune.py`:
46+
47+
```python
48+
# CONFIGURATION - Set these variables
49+
MODEL_NAME = "genbio-ai/AIDO.Cell-3M" # HuggingFace model handle
50+
NUM_CLASSES = 5 # Number of classification classes
51+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
52+
BATCH_SIZE = 8
53+
LEARNING_RATE = 1e-4
54+
NUM_EPOCHS = 5
55+
FREEZE_BACKBONE = False # Set True to freeze AIDO.Cell weights (ignored if USE_LORA=True)
56+
57+
# LoRA/PEFT Configuration
58+
USE_LORA = True # Set True to use LoRA for parameter-efficient fine-tuning
59+
LORA_R = 8 # LoRA rank (higher = more parameters, default: 8)
60+
LORA_ALPHA = 16 # LoRA alpha (scaling factor, default: 16)
61+
LORA_DROPOUT = 0.1 # LoRA dropout
62+
LORA_TARGET_MODULES = ["query", "value"] # Modules to apply LoRA (query, key, value, dense)
63+
```
64+
65+
2. **Run the fine-tuning script**:
66+
67+
```bash
68+
python finetune.py
69+
```
70+
71+
3. **Load your model**
72+
73+
After training, load your fine-tuned model:
74+
75+
```python
76+
from finetune import CellFoundationClassifier
77+
78+
model = CellFoundationClassifier(MODEL_NAME, NUM_CLASSES, FREEZE_BACKBONE, USE_LORA, lora_config)
79+
checkpoint = torch.load('best_model.pt')
80+
model.load_state_dict(checkpoint['model_state_dict'])
81+
```
82+
83+
## Quirks
84+
85+
AIDO.Cell was pre-trained on a fixed set of 19,264 genes using a read depth-aware objective function.
86+
All inputs should be processed using the `aido_cell.utils.gene_alignment` and `aido_cell.utils.preprocessing` tools.
87+
88+
1. Gene alignment
89+
1. Removes genes in your data that aren't in AIDO.Cell's gene set
90+
2. Adds zero-filled entries for genes in AIDO.Cell's set that are missing from your data
91+
3. Reorders genes to match AIDO.Cell's expected order
92+
4. Creates attention masks so the model knows which genes are actually present
93+
2. Preprocessing
94+
1. Calculates log10 of total counts per cell (minimum 5) for depth tokens
95+
2. Normalizes counts to log1p(CPM) where CPM = counts per 10,000
96+
3. Appends two depth tokens (rawcountsidx, inputcountidx) to the sequence.
97+
In pretraining these indicated the input and desired output depth, but in this script they are fixed to be equal.
98+
4. Clips values at 20
99+
5. Converts to bfloat16
100+
101+
## Package Structure
102+
103+
```
104+
aido.cell/
105+
├── embed.py # Embedding generation script
106+
├── finetune.py # Fine-tuning script with LoRA
107+
├── pyproject.toml # Package configuration
108+
├── aido_cell/ # Python package
109+
│ ├── __init__.py
110+
│ ├── models/ # CellFoundation model implementations
111+
│ │ ├── __init__.py
112+
│ │ ├── configuration_cellfoundation.py
113+
│ │ ├── modeling_cellfoundation.py
114+
│ │ └── gene_lists/ # Reference gene set (19,264 genes)
115+
│ │ └── OS_scRNA_gene_index.19264.tsv
116+
│ └── utils/ # Utility functions
117+
│ ├── __init__.py
118+
│ ├── gene_alignment.py # Gene alignment utilities
119+
│ └── preprocessing.py # Data normalization (log1p CPM + depth tokens)
120+
```
121+
122+
## Available Models
123+
124+
AIDO.Cell models on HuggingFace:
125+
- `genbio-ai/AIDO.Cell-3M`
126+
- `genbio-ai/AIDO.Cell-10M`
127+
- `genbio-ai/AIDO.Cell-100M`
128+
129+
Check the [AIDO.Cell HuggingFace page](https://huggingface.co/genbio-ai) for the latest models.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
"""AIDO.Cell: Standalone package for cell foundation models."""
2+
3+
from aido_cell.models import CellFoundationModel, CellFoundationConfig
4+
from aido_cell.utils.gene_alignment import align_adata
5+
from aido_cell.utils.preprocessing import preprocess_counts
6+
7+
__version__ = "0.1.0"
8+
9+
__all__ = [
10+
"CellFoundationModel",
11+
"CellFoundationConfig",
12+
"align_adata",
13+
"preprocess_counts",
14+
]
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
"""CellFoundation model implementations."""
2+
3+
from aido_cell.models.configuration_cellfoundation import CellFoundationConfig
4+
from aido_cell.models.modeling_cellfoundation import CellFoundationModel
5+
6+
__all__ = [
7+
"CellFoundationConfig",
8+
"CellFoundationModel",
9+
]

0 commit comments

Comments
 (0)