|
| 1 | +# AIDO.Cell |
| 2 | + |
| 3 | +Standalone AIDO.Cell model repo using HuggingFace handles. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +```bash |
| 8 | +# Barebones installation |
| 9 | +pip install -e . |
| 10 | + |
| 11 | +# FlashAttention2 support |
| 12 | +pip install -e ".[flash_attn]" |
| 13 | + |
| 14 | +# PEFT/LoRA support |
| 15 | +pip install -e ".[peft]" |
| 16 | + |
| 17 | +# All optional dependencies |
| 18 | +pip install -e ".[flash_attn,peft]" |
| 19 | +``` |
| 20 | + |
| 21 | +## Quickstart |
| 22 | + |
| 23 | +1. **Edit the configuration** in `embed.py`: |
| 24 | + |
| 25 | +```python |
| 26 | +# CONFIGURATION - Set these variables |
| 27 | +MODEL_NAME = "genbio-ai/AIDO.Cell-3M" # Or "genbio-ai/AIDO.Cell-100M" |
| 28 | +INPUT_FILE = "temp_adata.h5ad" # Path to your input file |
| 29 | +OUTPUT_FILE = None # Auto-generates: input_embeddings.h5ad |
| 30 | +DEVICE = "cuda" # "cuda" or "cpu" |
| 31 | +BATCH_SIZE = 32 |
| 32 | +EMBEDDING_KEY = "X_aido_cell" |
| 33 | +``` |
| 34 | + |
| 35 | +2. **Run the script**: |
| 36 | + |
| 37 | +```bash |
| 38 | +python embed.py |
| 39 | +``` |
| 40 | + |
| 41 | +## Finetune with LoRA |
| 42 | + |
| 43 | +> **Note**: Fine-tuning requires the `peft` optional dependency. Install with: `uv pip install -e ".[peft]"` |
| 44 | +
|
| 45 | +1. **Edit the configuration** in `finetune.py`: |
| 46 | + |
| 47 | +```python |
| 48 | +# CONFIGURATION - Set these variables |
| 49 | +MODEL_NAME = "genbio-ai/AIDO.Cell-3M" # HuggingFace model handle |
| 50 | +NUM_CLASSES = 5 # Number of classification classes |
| 51 | +DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
| 52 | +BATCH_SIZE = 8 |
| 53 | +LEARNING_RATE = 1e-4 |
| 54 | +NUM_EPOCHS = 5 |
| 55 | +FREEZE_BACKBONE = False # Set True to freeze AIDO.Cell weights (ignored if USE_LORA=True) |
| 56 | + |
| 57 | +# LoRA/PEFT Configuration |
| 58 | +USE_LORA = True # Set True to use LoRA for parameter-efficient fine-tuning |
| 59 | +LORA_R = 8 # LoRA rank (higher = more parameters, default: 8) |
| 60 | +LORA_ALPHA = 16 # LoRA alpha (scaling factor, default: 16) |
| 61 | +LORA_DROPOUT = 0.1 # LoRA dropout |
| 62 | +LORA_TARGET_MODULES = ["query", "value"] # Modules to apply LoRA (query, key, value, dense) |
| 63 | +``` |
| 64 | + |
| 65 | +2. **Run the fine-tuning script**: |
| 66 | + |
| 67 | +```bash |
| 68 | +python finetune.py |
| 69 | +``` |
| 70 | + |
| 71 | +3. **Load your model** |
| 72 | + |
| 73 | +After training, load your fine-tuned model: |
| 74 | + |
| 75 | +```python |
| 76 | +from finetune import CellFoundationClassifier |
| 77 | + |
| 78 | +model = CellFoundationClassifier(MODEL_NAME, NUM_CLASSES, FREEZE_BACKBONE, USE_LORA, lora_config) |
| 79 | +checkpoint = torch.load('best_model.pt') |
| 80 | +model.load_state_dict(checkpoint['model_state_dict']) |
| 81 | +``` |
| 82 | + |
| 83 | +## Quirks |
| 84 | + |
| 85 | +AIDO.Cell was pre-trained on a fixed set of 19,264 genes using a read depth-aware objective function. |
| 86 | +All inputs should be processed using the `aido_cell.utils.gene_alignment` and `aido_cell.utils.preprocessing` tools. |
| 87 | + |
| 88 | +1. Gene alignment |
| 89 | + 1. Removes genes in your data that aren't in AIDO.Cell's gene set |
| 90 | + 2. Adds zero-filled entries for genes in AIDO.Cell's set that are missing from your data |
| 91 | + 3. Reorders genes to match AIDO.Cell's expected order |
| 92 | + 4. Creates attention masks so the model knows which genes are actually present |
| 93 | +2. Preprocessing |
| 94 | + 1. Calculates log10 of total counts per cell (minimum 5) for depth tokens |
| 95 | + 2. Normalizes counts to log1p(CPM) where CPM = counts per 10,000 |
| 96 | + 3. Appends two depth tokens (rawcountsidx, inputcountidx) to the sequence. |
| 97 | + In pretraining these indicated the input and desired output depth, but in this script they are fixed to be equal. |
| 98 | + 4. Clips values at 20 |
| 99 | + 5. Converts to bfloat16 |
| 100 | + |
| 101 | +## Package Structure |
| 102 | + |
| 103 | +``` |
| 104 | +aido.cell/ |
| 105 | +├── embed.py # Embedding generation script |
| 106 | +├── finetune.py # Fine-tuning script with LoRA |
| 107 | +├── pyproject.toml # Package configuration |
| 108 | +├── aido_cell/ # Python package |
| 109 | +│ ├── __init__.py |
| 110 | +│ ├── models/ # CellFoundation model implementations |
| 111 | +│ │ ├── __init__.py |
| 112 | +│ │ ├── configuration_cellfoundation.py |
| 113 | +│ │ ├── modeling_cellfoundation.py |
| 114 | +│ │ └── gene_lists/ # Reference gene set (19,264 genes) |
| 115 | +│ │ └── OS_scRNA_gene_index.19264.tsv |
| 116 | +│ └── utils/ # Utility functions |
| 117 | +│ ├── __init__.py |
| 118 | +│ ├── gene_alignment.py # Gene alignment utilities |
| 119 | +│ └── preprocessing.py # Data normalization (log1p CPM + depth tokens) |
| 120 | +``` |
| 121 | + |
| 122 | +## Available Models |
| 123 | + |
| 124 | +AIDO.Cell models on HuggingFace: |
| 125 | +- `genbio-ai/AIDO.Cell-3M` |
| 126 | +- `genbio-ai/AIDO.Cell-10M` |
| 127 | +- `genbio-ai/AIDO.Cell-100M` |
| 128 | + |
| 129 | +Check the [AIDO.Cell HuggingFace page](https://huggingface.co/genbio-ai) for the latest models. |
0 commit comments