Skip to content

Commit 1ad3b93

Browse files
dariocazzaniclaude
andcommitted
Implement proper training recipe for baseline fix
- Add mixup augmentation support to training loops - Add label smoothing to loss functions - Add warmup schedule with configurable min_lr - Add --student-is-fp32 flag for FP32+KD control experiment - Update TrainConfig dataclass with recipe parameters - Create Phase 1-4 experiment command lists - Target: match published ~94% CIFAR-10, ~77% CIFAR-100 Addresses TMLR Round 0 reviewer feedback: - Reviewer 1: FP32+KD control experiment (isolate KD vs quantization) - Reviewer 2: Strengthen weak baselines with proper recipe - Reviewer 3: Fair comparison with matched training recipes Changes: - experiments/config.py: Add min_lr, mixup_alpha, label_smoothing fields - experiments/train.py: Expose recipe params via CLI, use in training loop - experiments/train_kd.py: Add --student-is-fp32 for FP32+KD control - experiments/training/loops.py: Implement mixup_data(), mixup_criterion(), update scheduler - experiments/training/kd_loss.py: Add label_smoothing parameter - PROPER_BASELINE_COMMANDS.sh: Command lists for 63 experiments across 4 phases - BASELINE_RECIPE.md: Comprehensive technical implementation guide - START_HERE.md: Step-by-step action plan for user - PLAN.md: Update with Round 0 fixes and acceptance strategy - scripts/phase*.sh: Reference phase scripts (user will use command lists) - paper/reviews/: Add Round 0 TMLR review files and synthesis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 57c4dcc commit 1ad3b93

25 files changed

Lines changed: 3019 additions & 988 deletions

BASELINE_RECIPE.md

Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# Proper FP32 Baselines - Implementation Guide
2+
3+
**Date**: 2026-02-14
4+
**Goal**: Match published results (~94% CIFAR-10, ~77% CIFAR-100) with proper training recipe
5+
6+
## Changes Implemented
7+
8+
### Code Updates (Done)
9+
- ✅ Added `mixup_alpha`, `label_smoothing`, `min_lr` to TrainConfig
10+
- ✅ Implemented mixup augmentation in `train_epoch()`
11+
- ✅ Added label smoothing to CrossEntropyLoss
12+
- ✅ Added `min_lr` to cosine scheduler
13+
- ✅ Added `--student-is-fp32` flag to train_kd.py for FP32+KD control
14+
- ✅ Updated KDLoss to accept label_smoothing
15+
16+
---
17+
18+
## Proper Training Recipe
19+
20+
**Strong recipe (to match published ~94%/~77%)**:
21+
```
22+
--epochs 300
23+
--warmup-epochs 5
24+
--min-lr 1e-5
25+
--mixup-alpha 0.2
26+
--label-smoothing 0.1
27+
```
28+
29+
**vs. Old weak recipe (current 88.88%/62.40%)**:
30+
```
31+
--epochs 200
32+
(no warmup, mixup, or label smoothing)
33+
```
34+
35+
---
36+
37+
## Step 1: Stop All Running Experiments
38+
39+
On lambda server:
40+
```bash
41+
ssh lambda
42+
cd ~/code/lab-strange-loop/bitnet
43+
44+
# Check what's running
45+
ps aux | grep python | grep train
46+
47+
# Kill all training
48+
pkill -f "python -m experiments"
49+
50+
# Verify nothing running
51+
ps aux | grep python | grep train
52+
```
53+
54+
---
55+
56+
## Step 2: Backup and Clean Results
57+
58+
```bash
59+
# Archive current results (just in case)
60+
tar -czf ~/backups/results_backup_feb14_$(date +%H%M).tar.gz results/
61+
62+
# Delete all results
63+
rm -rf results/
64+
mkdir -p results/raw results/processed
65+
66+
# Verify clean slate
67+
ls -la results/
68+
```
69+
70+
---
71+
72+
## Step 3: Define Proper Baseline Experiments
73+
74+
### Architectures & Datasets to Cover
75+
76+
**Priority 1** (Core paper results):
77+
- ResNet-18: CIFAR-10, CIFAR-100, Tiny ImageNet
78+
- ResNet-50: CIFAR-10, CIFAR-100, Tiny ImageNet
79+
80+
**Priority 2** (Architecture extension):
81+
- MobileNetV2: CIFAR-10, CIFAR-100, Tiny ImageNet
82+
- EfficientNet-B0: CIFAR-10, CIFAR-100, Tiny ImageNet
83+
- ConvNeXt-Tiny: CIFAR-10, CIFAR-100, Tiny ImageNet
84+
85+
### Seeds
86+
All experiments: seeds 42, 123, 456 (3 seeds for statistical testing)
87+
88+
---
89+
90+
## Step 4: Generate Experiment Commands
91+
92+
### 4a. Priority 1: ResNet Baselines
93+
94+
**FP32 baselines** (strong recipe, 300 epochs):
95+
```bash
96+
#!/bin/bash
97+
# save as: scripts/run_fp32_baselines_resnet.sh
98+
99+
DATASETS="cifar10 cifar100 tiny-imagenet"
100+
MODELS="resnet18 resnet50"
101+
SEEDS="42 123 456"
102+
103+
for model in $MODELS; do
104+
for dataset in $DATASETS; do
105+
for seed in $SEEDS; do
106+
uv run python -m experiments.train \
107+
--model $model --dataset $dataset \
108+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
109+
--mixup-alpha 0.2 --label-smoothing 0.1 \
110+
--seed $seed &
111+
done
112+
wait # Wait for 3 seeds to finish before next dataset
113+
done
114+
done
115+
```
116+
117+
**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours**
118+
**Parallelization**: Run all 3 seeds in parallel on 2 GPUs → ~45 hours wall-clock
119+
120+
---
121+
122+
### 4b. FP32+KD Control (after FP32 baselines complete)
123+
124+
**Critical control experiment** (uses FP32 teacher, trains FP32 student with KD):
125+
```bash
126+
#!/bin/bash
127+
# save as: scripts/run_fp32_kd_control.sh
128+
129+
DATASETS="cifar10 cifar100 tiny-imagenet"
130+
SEEDS="42 123 456"
131+
132+
for dataset in $DATASETS; do
133+
# Use seed 42 teacher (best from phase 4a)
134+
TEACHER_PATH="results/raw/${dataset}/resnet18/std_s42/best_model.pth"
135+
136+
for seed in $SEEDS; do
137+
uv run python -m experiments.train_kd \
138+
--model resnet18 --dataset $dataset \
139+
--teacher-path $TEACHER_PATH \
140+
--student-is-fp32 \
141+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
142+
--seed $seed &
143+
done
144+
wait
145+
done
146+
```
147+
148+
**Cost**: 3 datasets × 3 seeds = 9 experiments × ~5 hours = **45 GPU hours**
149+
150+
---
151+
152+
### 4c. BitNet Baselines (with strong recipe)
153+
154+
**BitNet with matched strong recipe**:
155+
```bash
156+
#!/bin/bash
157+
# save as: scripts/run_bitnet_baselines.sh
158+
159+
DATASETS="cifar10 cifar100 tiny-imagenet"
160+
MODELS="resnet18 resnet50"
161+
SEEDS="42 123 456"
162+
163+
for model in $MODELS; do
164+
for dataset in $DATASETS; do
165+
for seed in $SEEDS; do
166+
uv run python -m experiments.train \
167+
--model $model --dataset $dataset --bit-version \
168+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
169+
--mixup-alpha 0.2 --label-smoothing 0.1 \
170+
--seed $seed &
171+
done
172+
wait
173+
done
174+
done
175+
```
176+
177+
**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours**
178+
179+
---
180+
181+
### 4d. BitNet + Recipe (KD + conv1)
182+
183+
**Full recipe with strong training**:
184+
```bash
185+
#!/bin/bash
186+
# save as: scripts/run_bitnet_recipe.sh
187+
188+
DATASETS="cifar10 cifar100 tiny-imagenet"
189+
MODELS="resnet18 resnet50"
190+
SEEDS="42 123 456"
191+
192+
for model in $MODELS; do
193+
for dataset in $DATASETS; do
194+
# Use seed 42 teacher
195+
TEACHER_PATH="results/raw/${dataset}/${model}/std_s42/best_model.pth"
196+
197+
for seed in $SEEDS; do
198+
uv run python -m experiments.train_kd \
199+
--model $model --dataset $dataset \
200+
--teacher-path $TEACHER_PATH \
201+
--ablation keep_conv1 \
202+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
203+
--seed $seed &
204+
done
205+
wait
206+
done
207+
done
208+
```
209+
210+
**Cost**: 2 models × 3 datasets × 3 seeds = 18 experiments × ~5 hours = **90 GPU hours**
211+
212+
---
213+
214+
## Step 5: Execution Plan (2 GPUs in Parallel)
215+
216+
### Phase 1: FP32 Baselines (Day 1-2)
217+
**Run**: scripts/run_fp32_baselines_resnet.sh
218+
**Time**: ~45 hours wall-clock (2 GPUs, 3 seeds parallel)
219+
**Result**: 18 FP32 baselines with strong recipe
220+
221+
### Phase 2: FP32+KD Control (Day 3)
222+
**Run**: scripts/run_fp32_kd_control.sh
223+
**Time**: ~23 hours wall-clock
224+
**Result**: Critical control experiment (9 runs)
225+
226+
### Phase 3: BitNet Baselines (Day 4-5)
227+
**Run**: scripts/run_bitnet_baselines.sh
228+
**Time**: ~45 hours wall-clock
229+
**Result**: 18 BitNet baselines with strong recipe
230+
231+
### Phase 4: BitNet + Recipe (Day 6-7)
232+
**Run**: scripts/run_bitnet_recipe.sh
233+
**Time**: ~45 hours wall-clock
234+
**Result**: 18 recipe experiments
235+
236+
**Total wall-clock**: ~7 days with 2 GPUs running continuously
237+
238+
---
239+
240+
## Step 6: Parallel Execution Strategy (Faster)
241+
242+
If you want to finish faster, split across 2 GPUs:
243+
244+
**GPU 0**: ResNet-18 experiments
245+
**GPU 1**: ResNet-50 experiments
246+
247+
Modified scripts:
248+
```bash
249+
# GPU 0 - ResNet-18
250+
CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train \
251+
--model resnet18 --dataset cifar10 \
252+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
253+
--mixup-alpha 0.2 --label-smoothing 0.1 \
254+
--seed 42
255+
256+
# GPU 1 - ResNet-50
257+
CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train \
258+
--model resnet50 --dataset cifar10 \
259+
--epochs 300 --warmup-epochs 5 --min-lr 1e-5 \
260+
--mixup-alpha 0.2 --label-smoothing 0.1 \
261+
--seed 42
262+
```
263+
264+
With this parallelization:
265+
- Phase 1-4 can run in **~3.5 days** instead of 7
266+
267+
---
268+
269+
## Step 7: Priority 2 Architectures (Optional)
270+
271+
After Phase 1-4 complete and you verify results, optionally run:
272+
273+
**MobileNetV2, EfficientNet, ConvNeXt** (same 4-phase structure)
274+
**Note**: These need architecture-specific learning rates:
275+
- MobileNetV2: `--lr 0.01`
276+
- EfficientNet: `--lr 0.01`
277+
- ConvNeXt: `--optimizer adamw --lr 0.004`
278+
279+
**Cost**: 3 models × 3 datasets × 3 seeds × 4 phases = **~270 GPU hours**
280+
281+
---
282+
283+
## Step 8: Verification
284+
285+
After Phase 1 (FP32 baselines) complete:
286+
287+
```bash
288+
# Check ResNet-18/CIFAR-10 result
289+
cat results/raw/cifar10/resnet18/std_s42/results.json | grep final_test_acc
290+
291+
# Expected: ~94% (currently 88.88%)
292+
```
293+
294+
After Phase 2 (FP32+KD):
295+
```bash
296+
# Check if FP32+KD exceeds FP32
297+
cat results/raw/cifar10/resnet18/fp32_kd_s42/results.json | grep final_test_acc
298+
299+
# Expected: +1-2% over FP32 baseline
300+
```
301+
302+
---
303+
304+
## Step 9: Analysis & Decision
305+
306+
After all phases complete, analyze:
307+
308+
```python
309+
# aggregate_results.py will load all new experiments
310+
uv run python -m analysis.aggregate_results
311+
312+
# Check key comparisons
313+
import pandas as pd
314+
df = pd.read_csv("results/processed/aggregated.csv")
315+
316+
# Compare old vs new FP32 baselines
317+
old_fp32 = df[(df["model"] == "resnet18") & (df["dataset"] == "cifar100") & (df["version"] == "std")]
318+
print(f"Old FP32 CIFAR-100: {old_fp32['final_test_acc'].mean():.2f}%") # Should be 62.40%
319+
320+
new_fp32 = df[(df["model"] == "resnet18") & (df["dataset"] == "cifar100") & (df["version"] == "std") & (df["epochs"] == 300)]
321+
print(f"New FP32 CIFAR-100: {new_fp32['final_test_acc'].mean():.2f}%") # Target ~77%
322+
323+
# Check if recipe still exceeds FP32
324+
```
325+
326+
---
327+
328+
## Expected Outcomes
329+
330+
### Scenario A: Recipe still exceeds strong FP32 ✅
331+
- Paper becomes MUCH stronger
332+
- "Recipe achieves X% with proper training, matching/exceeding well-trained FP32"
333+
- **Acceptance probability: 85-90%**
334+
335+
### Scenario B: Recipe closes gap but doesn't exceed ⚠️
336+
- Paper is honest and defensible
337+
- "Recipe recovers X% of gap; augmentation asymmetry explains training dynamics"
338+
- **Acceptance probability: 75-80%**
339+
340+
### Scenario C: FP32+KD exceeds BitNet+recipe ⚠️
341+
- Must reframe contribution as "training dynamics understanding" not "deployment"
342+
- **Acceptance probability: 70-75%** (weaker but honest)
343+
344+
---
345+
346+
## Quick Start
347+
348+
1. **Stop experiments**: `pkill -f "python -m experiments"`
349+
2. **Clean results**: `rm -rf results/; mkdir -p results/raw results/processed`
350+
3. **Create scripts**: Copy Phase 1-4 scripts above to `scripts/` directory
351+
4. **Run Phase 1**: `bash scripts/run_fp32_baselines_resnet.sh`
352+
5. **Monitor**: `watch -n 10 'ls results/raw/*/resnet18/ | wc -l'`
353+
6. **Verify**: Check first results after ~5 hours
354+
7. **Continue**: Launch Phase 2-4 sequentially
355+
356+
---
357+
358+
## Notes
359+
360+
- All experiments use same random seeds (42, 123, 456) for fair comparison
361+
- Warmup is critical for CIFAR-100 (prevents early divergence)
362+
- Mixup + label smoothing together improve robustness
363+
- Min LR prevents complete collapse at end of cosine schedule
364+
- FP32+KD control is THE most important experiment (all 3 reviewers flagged this)
365+
366+
**Expected total time**: ~3.5-7 days depending on parallelization strategy
367+
**Expected total GPU hours**: ~315 hours for Priority 1 (ResNet only)

0 commit comments

Comments
 (0)