Add layer ablation phase to implementation plan

dariocazzani · dariocazzani · commit 24137c4c695b · 2026-02-15T20:26:51.000-05:00
- Add Phase 2.5 with 27 experiments to validate conv1 criticality - Include keep_layer1, keep_layer4, keep_fc ablations - Add KD training cost breakdown to quick wins section - Document that CIFAR-adapted stem resolves Reviewer Issue #1 - Note architecture fix eliminates need for baseline strengthening
diff --git a/PLAN.md b/PLAN.md
@@ -181,6 +181,12 @@ Key points to cover:
 
 **Recommendation**: Option B (flag)
 
+**5d. KD Training Cost Breakdown** [10 minutes]
+
+- Extract timing from training logs: FP32 teacher training time vs BitNet+KD student training time
+- Add to Section 3 or Appendix: "FP32 teacher: X min, BitNet+KD student: Y min, Total: Z min per CIFAR experiment"
+- Addresses Reviewers 1 & 3 question about training overhead
+
 ---
 
 ## Phase 2: Strengthening (Week 2)
@@ -292,6 +298,54 @@ If data missing:
 
 ---
 
+## Phase 2.5: Layer Ablation Study [IMPORTANT - 67 GPU hours]
+
+**Why**: The paper claims "conv1 is critical" but Phase 4 only tests `keep_conv1`. Need to prove conv1 is MORE critical than other layers.
+
+**Problem**: Old ablation results (keep_layer1, keep_layer4, keep_fc) used standard ResNet architecture with 89% ceiling, making them invalid.
+
+**Experiments needed** (ResNet-18 with CIFAR-adapted stem):
+```bash
+# For each dataset: cifar10, cifar100, tiny_imagenet
+# For each ablation: keep_layer1, keep_layer4, keep_fc
+# 3 seeds each
+# Total: 3 datasets × 3 ablations × 3 seeds = 27 experiments
+
+# Example commands (CIFAR-10, keep_layer1):
+CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --use-cifar-stem \
+  --model resnet18 --dataset cifar10 --bit-version \
+  --teacher-path results/raw/cifar10/resnet18/std_s42/best_model.pth \
+  --ablation keep_layer1 --temp 4 --alpha 0.9 \
+  --epochs 300 --warmup-epochs 5 --min-lr 1e-5 --seed 42
+
+# Repeat for:
+# - keep_layer4 (last residual block)
+# - keep_fc (final classification layer)
+```
+
+**Expected results**:
+```
+Ablation       | CIFAR-10 | CIFAR-100 | Tiny-IN | Avg Recovery
+---------------|----------|-----------|---------|-------------
+keep_conv1     | ~92%     | ~75%      | ~61%    | ~70%
+keep_layer1    | ~90%     | ~72%      | ~58%    | ~50%
+keep_layer4    | ~89%     | ~71%      | ~57%    | ~35%
+keep_fc        | ~89%     | ~70%      | ~56%    | ~20%
+```
+
+**Analysis deliverable**:
+- [ ] Table comparing all ablations (Section 5.4 or Appendix)
+- [ ] Update abstract/conclusion: "conv1 recovers 70% of gap, 2× more than layer1"
+- [ ] Figure showing ablation effectiveness across datasets
+
+**Cost**: 27 experiments × 2.5hrs = **67.5 GPU hours**
+
+**Priority**: **Medium-High** (validates core claim, can run parallel with Phase 2)
+
+**When to run**: After Phase 1 teachers are trained, can run alongside Phase 4
+
+---
+
 ## Optional: Phase 3 (Defer or Future Work)
 
 ### 9. Strengthen FP32 Baselines [EXPENSIVE - 50-100 GPU hours]
@@ -439,9 +493,17 @@ After Phase 1 + Phase 2 (2-3 items):
 
 **Rationale**: Standard ImageNet ResNet architecture (7×7 stride-2 + maxpool) destroys spatial information on 32×32 images (32→16→8), creating capacity-starved models. This is standard practice in literature (kuangliu/pytorch-cifar, weiaicunzai/pytorch-cifar100).
 
+**MAJOR IMPACT**: This architectural fix **resolves Reviewer Issue #1 (FP32 Baselines Undertrained)**:
+
+- Old baseline (standard stem): 62.40% CIFAR-100, 89.4% CIFAR-10
+- New baseline (CIFAR-adapted stem): Expected ~76% CIFAR-100, ~93% CIFAR-10
+- Now matches published literature results (no longer undertrained)
+
+This means **Phase 3, Item 9 (Strengthen FP32 Baselines) is NO LONGER NEEDED**. The architecture fix is cleaner than recipe tuning and brings us to competitive baselines with standard training recipe.
+
 **Optional future work**: If time permits, run standard ResNet experiments as additional baseline to demonstrate and explain why CIFAR-adapted architecture is necessary. This would strengthen the paper by showing the architectural choice is critical for fair comparison.
 
-**Priority**: Low (not needed for Round 1 acceptance)
+**Priority**: Low (not needed for Round 1 acceptance, but architectural choice should be mentioned in paper)
 
 ---