|
| 1 | +# Proper FP32 Baselines - Implementation Guide |
| 2 | + |
| 3 | +**Date**: 2026-02-14 |
| 4 | +**Goal**: Match published results (~94% CIFAR-10, ~77% CIFAR-100) with proper training recipe |
| 5 | + |
| 6 | +## Changes Implemented |
| 7 | + |
| 8 | +### Code Updates (Done) |
| 9 | +- ✅ Added `mixup_alpha`, `label_smoothing`, `min_lr` to TrainConfig |
| 10 | +- ✅ Implemented mixup augmentation in `train_epoch()` |
| 11 | +- ✅ Added label smoothing to CrossEntropyLoss |
| 12 | +- ✅ Added `min_lr` to cosine scheduler |
| 13 | +- ✅ Added `--student-is-fp32` flag to train_kd.py for FP32+KD control |
| 14 | +- ✅ Updated KDLoss to accept label_smoothing |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## Proper Training Recipe |
| 19 | + |
| 20 | +**Strong recipe (to match published ~94%/~77%)**: |
| 21 | +``` |
| 22 | +--epochs 300 |
| 23 | +--warmup-epochs 5 |
| 24 | +--min-lr 1e-5 |
| 25 | +--mixup-alpha 0.2 |
| 26 | +--label-smoothing 0.1 |
| 27 | +``` |
| 28 | + |
| 29 | +**vs. Old weak recipe (current 88.88%/62.40%)**: |
| 30 | +``` |
| 31 | +--epochs 200 |
| 32 | +(no warmup, mixup, or label smoothing) |
| 33 | +``` |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Step 1: Stop All Running Experiments |
| 38 | + |
| 39 | +On lambda server: |
| 40 | +```bash |
| 41 | +ssh lambda |
| 42 | +cd ~/code/lab-strange-loop/bitnet |
| 43 | + |
| 44 | +# Check what's running |
| 45 | +ps aux | grep python | grep train |
| 46 | + |
| 47 | +# Kill all training |
| 48 | +pkill -f "python -m experiments" |
| 49 | + |
| 50 | +# Verify nothing running |
| 51 | +ps aux | grep python | grep train |
| 52 | +``` |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## Step 2: Backup and Clean Results |
| 57 | + |
| 58 | +```bash |
| 59 | +# Archive current results (just in case) |
| 60 | +tar -czf ~/backups/results_backup_feb14_$(date +%H%M).tar.gz results/ |
| 61 | + |
| 62 | +# Delete all results |
| 63 | +rm -rf results/ |
| 64 | +mkdir -p results/raw results/processed |
| 65 | + |
| 66 | +# Verify clean slate |
| 67 | +ls -la results/ |
| 68 | +``` |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Step 3: Define Proper Baseline Experiments |
| 73 | + |
| 74 | +### Architectures & Datasets to Cover |
| 75 | + |
| 76 | +**Priority 1** (Core paper results): |
| 77 | +- ResNet-18: CIFAR-10, CIFAR-100, Tiny ImageNet |
| 78 | +- ResNet-50: CIFAR-10, CIFAR-100, Tiny ImageNet |
| 79 | + |
| 80 | +**Priority 2** (Architecture extension): |
| 81 | +- MobileNetV2: CIFAR-10, CIFAR-100, Tiny ImageNet |
| 82 | +- EfficientNet-B0: CIFAR-10, CIFAR-100, Tiny ImageNet |
| 83 | +- ConvNeXt-Tiny: CIFAR-10, CIFAR-100, Tiny ImageNet |
| 84 | + |
| 85 | +### Seeds |
| 86 | +All experiments: seeds 42, 123, 456 (3 seeds for statistical testing) |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## Step 4: Generate Experiment Commands |
| 91 | + |
| 92 | +### 4a. Priority 1: ResNet Baselines |
| 93 | + |
| 94 | +**FP32 baselines** (strong recipe, 300 epochs): |
| 95 | +```bash |
| 96 | +#!/bin/bash |
| 97 | +# save as: scripts/run_fp32_baselines_resnet.sh |
| 98 | + |
| 99 | +DATASETS="cifar10 cifar100 tiny-imagenet" |
| 100 | +MODELS="resnet18 resnet50" |
| 101 | +SEEDS="42 123 456" |
| 102 | + |
| 103 | +for model in $MODELS; do |
| 104 | + for dataset in $DATASETS; do |
| 105 | + for seed in $SEEDS; do |
| 106 | + uv run python -m experiments.train \ |
| 107 | + --model $model --dataset $dataset \ |
| 108 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 109 | + --mixup-alpha 0.2 --label-smoothing 0.1 \ |
| 110 | + --seed $seed & |
| 111 | + done |
| 112 | + wait # Wait for 3 seeds to finish before next dataset |
| 113 | + done |
| 114 | +done |
| 115 | +``` |
| 116 | + |
| 117 | +**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours** |
| 118 | +**Parallelization**: Run all 3 seeds in parallel on 2 GPUs → ~45 hours wall-clock |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +### 4b. FP32+KD Control (after FP32 baselines complete) |
| 123 | + |
| 124 | +**Critical control experiment** (uses FP32 teacher, trains FP32 student with KD): |
| 125 | +```bash |
| 126 | +#!/bin/bash |
| 127 | +# save as: scripts/run_fp32_kd_control.sh |
| 128 | + |
| 129 | +DATASETS="cifar10 cifar100 tiny-imagenet" |
| 130 | +SEEDS="42 123 456" |
| 131 | + |
| 132 | +for dataset in $DATASETS; do |
| 133 | + # Use seed 42 teacher (best from phase 4a) |
| 134 | + TEACHER_PATH="results/raw/${dataset}/resnet18/std_s42/best_model.pth" |
| 135 | + |
| 136 | + for seed in $SEEDS; do |
| 137 | + uv run python -m experiments.train_kd \ |
| 138 | + --model resnet18 --dataset $dataset \ |
| 139 | + --teacher-path $TEACHER_PATH \ |
| 140 | + --student-is-fp32 \ |
| 141 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 142 | + --seed $seed & |
| 143 | + done |
| 144 | + wait |
| 145 | +done |
| 146 | +``` |
| 147 | + |
| 148 | +**Cost**: 3 datasets × 3 seeds = 9 experiments × ~5 hours = **45 GPU hours** |
| 149 | + |
| 150 | +--- |
| 151 | + |
| 152 | +### 4c. BitNet Baselines (with strong recipe) |
| 153 | + |
| 154 | +**BitNet with matched strong recipe**: |
| 155 | +```bash |
| 156 | +#!/bin/bash |
| 157 | +# save as: scripts/run_bitnet_baselines.sh |
| 158 | + |
| 159 | +DATASETS="cifar10 cifar100 tiny-imagenet" |
| 160 | +MODELS="resnet18 resnet50" |
| 161 | +SEEDS="42 123 456" |
| 162 | + |
| 163 | +for model in $MODELS; do |
| 164 | + for dataset in $DATASETS; do |
| 165 | + for seed in $SEEDS; do |
| 166 | + uv run python -m experiments.train \ |
| 167 | + --model $model --dataset $dataset --bit-version \ |
| 168 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 169 | + --mixup-alpha 0.2 --label-smoothing 0.1 \ |
| 170 | + --seed $seed & |
| 171 | + done |
| 172 | + wait |
| 173 | + done |
| 174 | +done |
| 175 | +``` |
| 176 | + |
| 177 | +**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours** |
| 178 | + |
| 179 | +--- |
| 180 | + |
| 181 | +### 4d. BitNet + Recipe (KD + conv1) |
| 182 | + |
| 183 | +**Full recipe with strong training**: |
| 184 | +```bash |
| 185 | +#!/bin/bash |
| 186 | +# save as: scripts/run_bitnet_recipe.sh |
| 187 | + |
| 188 | +DATASETS="cifar10 cifar100 tiny-imagenet" |
| 189 | +MODELS="resnet18 resnet50" |
| 190 | +SEEDS="42 123 456" |
| 191 | + |
| 192 | +for model in $MODELS; do |
| 193 | + for dataset in $DATASETS; do |
| 194 | + # Use seed 42 teacher |
| 195 | + TEACHER_PATH="results/raw/${dataset}/${model}/std_s42/best_model.pth" |
| 196 | + |
| 197 | + for seed in $SEEDS; do |
| 198 | + uv run python -m experiments.train_kd \ |
| 199 | + --model $model --dataset $dataset \ |
| 200 | + --teacher-path $TEACHER_PATH \ |
| 201 | + --ablation keep_conv1 \ |
| 202 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 203 | + --seed $seed & |
| 204 | + done |
| 205 | + wait |
| 206 | + done |
| 207 | +done |
| 208 | +``` |
| 209 | + |
| 210 | +**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours** |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## Step 5: Execution Plan (2 GPUs in Parallel) |
| 215 | + |
| 216 | +### Phase 1: FP32 Baselines (Day 1-2) |
| 217 | +**Run**: scripts/run_fp32_baselines_resnet.sh |
| 218 | +**Time**: ~45 hours wall-clock (2 GPUs, 3 seeds parallel) |
| 219 | +**Result**: 18 FP32 baselines with strong recipe |
| 220 | + |
| 221 | +### Phase 2: FP32+KD Control (Day 3) |
| 222 | +**Run**: scripts/run_fp32_kd_control.sh |
| 223 | +**Time**: ~23 hours wall-clock |
| 224 | +**Result**: Critical control experiment (9 runs) |
| 225 | + |
| 226 | +### Phase 3: BitNet Baselines (Day 4-5) |
| 227 | +**Run**: scripts/run_bitnet_baselines.sh |
| 228 | +**Time**: ~45 hours wall-clock |
| 229 | +**Result**: 18 BitNet baselines with strong recipe |
| 230 | + |
| 231 | +### Phase 4: BitNet + Recipe (Day 6-7) |
| 232 | +**Run**: scripts/run_bitnet_recipe.sh |
| 233 | +**Time**: ~45 hours wall-clock |
| 234 | +**Result**: 18 recipe experiments |
| 235 | + |
| 236 | +**Total wall-clock**: ~7 days with 2 GPUs running continuously |
| 237 | + |
| 238 | +--- |
| 239 | + |
| 240 | +## Step 6: Parallel Execution Strategy (Faster) |
| 241 | + |
| 242 | +If you want to finish faster, split across 2 GPUs: |
| 243 | + |
| 244 | +**GPU 0**: ResNet-18 experiments |
| 245 | +**GPU 1**: ResNet-50 experiments |
| 246 | + |
| 247 | +Modified scripts: |
| 248 | +```bash |
| 249 | +# GPU 0 - ResNet-18 |
| 250 | +CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train \ |
| 251 | + --model resnet18 --dataset cifar10 \ |
| 252 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 253 | + --mixup-alpha 0.2 --label-smoothing 0.1 \ |
| 254 | + --seed 42 |
| 255 | + |
| 256 | +# GPU 1 - ResNet-50 |
| 257 | +CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train \ |
| 258 | + --model resnet50 --dataset cifar10 \ |
| 259 | + --epochs 300 --warmup-epochs 5 --min-lr 1e-5 \ |
| 260 | + --mixup-alpha 0.2 --label-smoothing 0.1 \ |
| 261 | + --seed 42 |
| 262 | +``` |
| 263 | + |
| 264 | +With this parallelization: |
| 265 | +- Phase 1-4 can run in **~3.5 days** instead of 7 |
| 266 | + |
| 267 | +--- |
| 268 | + |
| 269 | +## Step 7: Priority 2 Architectures (Optional) |
| 270 | + |
| 271 | +After Phase 1-4 complete and you verify results, optionally run: |
| 272 | + |
| 273 | +**MobileNetV2, EfficientNet, ConvNeXt** (same 4-phase structure) |
| 274 | +**Note**: These need architecture-specific learning rates: |
| 275 | +- MobileNetV2: `--lr 0.01` |
| 276 | +- EfficientNet: `--lr 0.01` |
| 277 | +- ConvNeXt: `--optimizer adamw --lr 0.004` |
| 278 | + |
| 279 | +**Cost**: 3 models × 3 datasets × 3 seeds × 4 phases = **~270 GPU hours** |
| 280 | + |
| 281 | +--- |
| 282 | + |
| 283 | +## Step 8: Verification |
| 284 | + |
| 285 | +After Phase 1 (FP32 baselines) complete: |
| 286 | + |
| 287 | +```bash |
| 288 | +# Check ResNet-18/CIFAR-10 result |
| 289 | +cat results/raw/cifar10/resnet18/std_s42/results.json | grep final_test_acc |
| 290 | + |
| 291 | +# Expected: ~94% (currently 88.88%) |
| 292 | +``` |
| 293 | + |
| 294 | +After Phase 2 (FP32+KD): |
| 295 | +```bash |
| 296 | +# Check if FP32+KD exceeds FP32 |
| 297 | +cat results/raw/cifar10/resnet18/fp32_kd_s42/results.json | grep final_test_acc |
| 298 | + |
| 299 | +# Expected: +1-2% over FP32 baseline |
| 300 | +``` |
| 301 | + |
| 302 | +--- |
| 303 | + |
| 304 | +## Step 9: Analysis & Decision |
| 305 | + |
| 306 | +After all phases complete, analyze: |
| 307 | + |
| 308 | +```python |
| 309 | +# aggregate_results.py will load all new experiments |
| 310 | +uv run python -m analysis.aggregate_results |
| 311 | + |
| 312 | +# Check key comparisons |
| 313 | +import pandas as pd |
| 314 | +df = pd.read_csv("results/processed/aggregated.csv") |
| 315 | + |
| 316 | +# Compare old vs new FP32 baselines |
| 317 | +old_fp32 = df[(df["model"] == "resnet18") & (df["dataset"] == "cifar100") & (df["version"] == "std")] |
| 318 | +print(f"Old FP32 CIFAR-100: {old_fp32['final_test_acc'].mean():.2f}%") # Should be 62.40% |
| 319 | + |
| 320 | +new_fp32 = df[(df["model"] == "resnet18") & (df["dataset"] == "cifar100") & (df["version"] == "std") & (df["epochs"] == 300)] |
| 321 | +print(f"New FP32 CIFAR-100: {new_fp32['final_test_acc'].mean():.2f}%") # Target ~77% |
| 322 | + |
| 323 | +# Check if recipe still exceeds FP32 |
| 324 | +``` |
| 325 | + |
| 326 | +--- |
| 327 | + |
| 328 | +## Expected Outcomes |
| 329 | + |
| 330 | +### Scenario A: Recipe still exceeds strong FP32 ✅ |
| 331 | +- Paper becomes MUCH stronger |
| 332 | +- "Recipe achieves X% with proper training, matching/exceeding well-trained FP32" |
| 333 | +- **Acceptance probability: 85-90%** |
| 334 | + |
| 335 | +### Scenario B: Recipe closes gap but doesn't exceed ⚠️ |
| 336 | +- Paper is honest and defensible |
| 337 | +- "Recipe recovers X% of gap; augmentation asymmetry explains training dynamics" |
| 338 | +- **Acceptance probability: 75-80%** |
| 339 | + |
| 340 | +### Scenario C: FP32+KD exceeds BitNet+recipe ⚠️ |
| 341 | +- Must reframe contribution as "training dynamics understanding" not "deployment" |
| 342 | +- **Acceptance probability: 70-75%** (weaker but honest) |
| 343 | + |
| 344 | +--- |
| 345 | + |
| 346 | +## Quick Start |
| 347 | + |
| 348 | +1. **Stop experiments**: `pkill -f "python -m experiments"` |
| 349 | +2. **Clean results**: `rm -rf results/; mkdir -p results/raw results/processed` |
| 350 | +3. **Create scripts**: Copy Phase 1-4 scripts above to `scripts/` directory |
| 351 | +4. **Run Phase 1**: `bash scripts/run_fp32_baselines_resnet.sh` |
| 352 | +5. **Monitor**: `watch -n 10 'ls results/raw/*/resnet18/ | wc -l'` |
| 353 | +6. **Verify**: Check first results after ~5 hours |
| 354 | +7. **Continue**: Launch Phase 2-4 sequentially |
| 355 | + |
| 356 | +--- |
| 357 | + |
| 358 | +## Notes |
| 359 | + |
| 360 | +- All experiments use same random seeds (42, 123, 456) for fair comparison |
| 361 | +- Warmup is critical for CIFAR-100 (prevents early divergence) |
| 362 | +- Mixup + label smoothing together improve robustness |
| 363 | +- Min LR prevents complete collapse at end of cosine schedule |
| 364 | +- FP32+KD control is THE most important experiment (all 3 reviewers flagged this) |
| 365 | + |
| 366 | +**Expected total time**: ~3.5-7 days depending on parallelization strategy |
| 367 | +**Expected total GPU hours**: ~315 hours for Priority 1 (ResNet only) |
0 commit comments