Update PLAN with MobileNetV2 results (Key Finding #10)

dariocazzani · dariocazzani · commit bee362661d62 · 2026-02-08T09:26:35.000-05:00
- MobileNetV2 shows ~7x larger accuracy gap than ResNet (21-22% vs 3-4%)
- Depthwise separable convolutions accumulate more quantization error
- Update experiment count from 177 to 195
diff --git a/PLAN.md b/PLAN.md
@@ -1,6 +1,6 @@
 # BitNet CNN Research Plan
 
-**Last Updated**: Feb 5, 2026
+**Last Updated**: Feb 6, 2026
 **Primary Target**: CVPR 2026 Workshop (~Apr 2026)
 **Parallel Track**: NeurIPS 2026 Efficient ML Workshop (Aug 2026)
 **Backup**: WACV 2027 Round 2 (Sept 2026)
@@ -44,7 +44,7 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
 
 ---
 
-## Current State (165 experiments completed)
+## Current State (195 experiments completed)
 
 ### Results Summary
 
@@ -239,19 +239,85 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
 
 **Insight**: Larger models are harder to fully recover. More parameters = more information lost to quantization. ResNet18 remains the sweet spot for ternary deployment.
 
-### Key Finding #8: Alpha Ablation ⭐ IN PROGRESS
+### Key Finding #8: Alpha Ablation ✅ COMPLETE
 
-**Surprise**: α=0.5 (equal hard/soft weight) beats literature default α=0.9!
+**Surprise**: Lower α (more hard labels) works better, especially on harder tasks!
 
-| Alpha | CIFAR-10 (conv1+KD) |
-|-------|---------------------|
-| α=0.9 (default) | 88.66% |
-| α=0.7 | 88.73% |
-| **α=0.5** | **88.90%** |
+**CIFAR-10 Results**:
+| Alpha | Seed 42 | Seed 123 | Seed 456 | Mean ± Std |
+|-------|---------|----------|----------|------------|
+| α=0.9 (default) | 88.66% | - | - | - |
+| α=0.7 | **88.73%** | 88.43% | 88.52% | 88.56 ± 0.13% |
+| α=0.5 | 88.90% | 88.40% | 88.45% | 88.58 ± 0.23% |
+| α=0.4 | 88.60% | - | - | - |
+| α=0.3 | 88.71% | - | - | - |
 
-**Implication**: For ternary networks, equal weighting of hard and soft labels works better than the heavy soft-label weighting recommended in standard KD literature.
+**CIFAR-100 Results**:
+| Alpha | Accuracy |
+|-------|----------|
+| α=0.9 (default) | 63.40% |
+| α=0.7 | 63.35% |
+| **α=0.5** | **63.82%** |
 
-**Status**: 🔄 Running CIFAR-100 + fine-grained search + more seeds
+**Key insights**:
+- CIFAR-10: α=0.7 most consistent (lowest variance), α=0.5 has highest single-run but high variance
+- CIFAR-100: α=0.5 clearly best (+0.42% over default)
+- Harder tasks benefit more from hard labels (lower α)
+- Standard KD literature recommends α=0.9, but ternary networks prefer α=0.5-0.7
+
+**Status**: ✅ Complete (12 alpha experiments + 3-seed validation)
+
+### Key Finding #9: Combined Hyperparameters ⚠️ NEGATIVE RESULT
+
+**Tested whether combining "optimal" T and α improves over defaults.**
+
+**CIFAR-10 (T=5, α=0.7) vs Default (T=4, α=0.9)**:
+| Config | Seed 42 | Seed 123 | Seed 456 | Mean ± Std |
+|--------|---------|----------|----------|------------|
+| Default (T=4, α=0.9) | 88.66% | 88.53% | 88.25% | **88.48 ± 0.17%** |
+| Optimized (T=5, α=0.7) | 88.48% | 88.56% | 88.35% | 88.46 ± 0.09% |
+
+**CIFAR-100 (T=6, α=0.5) vs Default (T=4, α=0.9)**:
+| Config | Seed 42 | Seed 123 | Seed 456 | Mean ± Std |
+|--------|---------|----------|----------|------------|
+| Default (T=4, α=0.9) | 63.41% | 63.48% | 63.30% | **63.40 ± 0.07%** |
+| Optimized (T=6, α=0.5) | 62.95% | 63.00% | 62.77% | 62.91 ± 0.10% |
+
+**Key finding**: Individual ablations showed T=5/T=6 and α=0.5-0.7 were better in isolation, but they have **negative interaction** when combined:
+- CIFAR-10: No improvement (88.46% vs 88.48%)
+- CIFAR-100: Actually **worse** by 0.5% (62.91% vs 63.40%)
+
+**Implication for paper**: Keep T=4, α=0.9 as the recipe. The recipe works **out-of-the-box without tuning**.
+
+**Status**: ✅ Complete (6 runs across 2 datasets × 3 seeds)
+
+### Key Finding #10: MobileNetV2 Architecture Sensitivity ✅ NEW
+
+**MobileNetV2 has ~7x larger accuracy gap than ResNet - depthwise separable convolutions are catastrophically sensitive to ternary quantization.**
+
+| Model | Dataset | FP32 | BitNet | Gap |
+|-------|---------|------|--------|-----|
+| MobileNetV2 | CIFAR-10 | 84.63 ± 0.48% | 63.05 ± 2.69% | **-21.57%** |
+| MobileNetV2 | CIFAR-100 | 56.10 ± 0.20% | 33.51 ± 3.30% | **-22.59%** |
+
+**Comparison with ResNet18:**
+| Model | CIFAR-10 Gap | CIFAR-100 Gap |
+|-------|--------------|---------------|
+| ResNet18 | 2.95% | 3.72% |
+| MobileNetV2 | **21.57%** | **22.59%** |
+| Ratio | **~7.3x worse** | **~6.1x worse** |
+
+**Key insights:**
+- Depthwise separable convolutions accumulate significantly more quantization error than standard convolutions
+- High variance in BitNet results (2.69-3.30%) vs ResNet (~0.5%) suggests training instability
+- Validates literature: "MobileNets are dramatically more sensitive to quantization at any bit-width" (CVPR 2021)
+- **Strengthens our paper**: The conv1+KD recipe becomes even more critical for efficient architectures
+
+**Implication for paper**: Add MobileNetV2 results to Discussion as "architectural limitations" - some architectures need more aggressive intervention than conv1+KD.
+
+**Status**: ✅ Complete (12 runs: 2 datasets × 2 versions × 3 seeds)
+
+---
 
 ### TTQ Comparison Strategy
 
@@ -261,6 +327,19 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
 
 This reframes the gap as an **intentional design tradeoff** for deployment efficiency.
 
+### Architecture-Specific Layer Names
+
+Different architectures name their first convolutional layer differently. For the `keep_conv1` ablation:
+
+| Architecture | First Conv Layer Name | Notes |
+|--------------|----------------------|-------|
+| ResNet18/50 | `conv1` | ✅ Supported |
+| MobileNetV2 | `conv_stem` | ⚠️ Need to add `KEEP_CONV_STEM` to `AblationMode` |
+| EfficientNet | `conv_stem` | Same as MobileNetV2 |
+| VGG16 | `features.0` | Different naming convention |
+
+**To add MobileNetV2 ablation support**: Update `experiments/config.py` to add `KEEP_CONV_STEM = "keep_conv_stem"` and map it to `{"conv_stem"}` in `ABLATION_SKIP_LAYERS`.
+
 ---
 
 ## Immediate Next Steps
@@ -293,11 +372,12 @@ Based on KD research, these experiments could further improve results:
 
 | Temperature | CIFAR-10 | CIFAR-100 |
 |-------------|----------|-----------|
-| T=4 (default) | **88.66%** | 63.40% |
+| T=4 (default) | 88.66% | 63.40% |
+| **T=5** | **88.79%** | 63.20% |
 | T=6 | 88.23% | **63.89%** |
 | T=8 | 88.34% | 62.04% |
 
-**Conclusion**: T=4 optimal for CIFAR-10, T=6 slightly better for CIFAR-100 (+0.49%). Higher temperatures hurt on both.
+**Conclusion**: T=5 optimal for CIFAR-10 (+0.13%), T=6 for CIFAR-100 (+0.49%).
 
 #### 2b. CIFAR-100 conv1+KD (Validates Recipe) ✅ COMPLETE
 
@@ -313,19 +393,20 @@ Based on KD research, these experiments could further improve results:
 
 **Key insight**: On harder tasks, KD provides stronger regularization, enabling ternary networks to surpass full-precision baselines.
 
-#### 2c. Alpha Ablation ⭐ IN PROGRESS
+#### 2c. Alpha Ablation ✅ COMPLETE
 
-**Surprise finding**: α=0.5 (equal hard/soft) beats literature default α=0.9!
+**Finding**: Lower α (more hard labels) works better, especially on harder tasks.
 
-| Alpha | CIFAR-10 Accuracy |
-|-------|-------------------|
-| α=0.9 (default) | 88.66% |
-| α=0.7 | 88.73% |
-| **α=0.5** | **88.90%** |
+**CIFAR-10 (3 seeds)**:
+| Alpha | Mean ± Std | Best Single |
+|-------|------------|-------------|
+| α=0.9 (default) | - | 88.66% |
+| α=0.7 | 88.56 ± 0.13% | 88.73% |
+| α=0.5 | 88.58 ± 0.23% | 88.90% |
 
-**Implication**: Equal weight on hard and soft labels works better than the literature default for ternary networks.
+**CIFAR-100**: α=0.5 achieves **63.82%** (+0.42% over default α=0.9)
 
-**Status**: 🔄 Running more experiments (CIFAR-100 + fine-grained search + more seeds)
+**Conclusion**: Use α=0.7 for CIFAR-10 (lowest variance), α=0.5 for CIFAR-100 (best accuracy). Harder tasks benefit more from hard labels.
 
 #### 2d. Feature Distillation (Bigger Effort, Bigger Gain)
 **Goal**: Implement DCQ/OFF-style feature distillation.
@@ -354,8 +435,9 @@ Based on KD research, these experiments could further improve results:
 
 ### Currently Running
 
+- 🔄 **ResNet50 optimized hyperparams**: Testing T=5/α=0.7 (CIFAR-10) and T=6/α=0.5 (CIFAR-100)
+  - Expected: Likely same negative result as ResNet18 (defaults work best)
 - 🔄 **ImageNet recipe**: keep_conv1 and keep_conv1+KD on ImageNet
-- 🔄 **Alpha ablation**: CIFAR-100 (α=0.5, 0.7) + fine-grained search (α=0.3, 0.4, 0.6) + more seeds
 
 ---
 
@@ -368,12 +450,14 @@ Based on KD research, these experiments could further improve results:
 - ✅ **conv1 + KD combo (CIFAR-10)**: 88.48 ± 0.17% accuracy (88% gap recovery)
 - ✅ **conv1 + KD combo (CIFAR-100)**: 63.40 ± 0.09% accuracy (**exceeds FP32 by 1.0%!**)
 - ✅ **Temperature ablation**: T=4 optimal CIFAR-10, T=6 slightly better CIFAR-100
-- ✅ **Alpha ablation (initial)**: α=0.5 beats α=0.9 on CIFAR-10 (88.90% vs 88.66%)
+- ✅ **Alpha ablation**: α=0.7 best for CIFAR-10 (lowest variance), α=0.5 best for CIFAR-100 (+0.42%)
+- ✅ **Combined hyperparameters**: ⚠️ Negative result - combining optimal T+α doesn't improve, defaults (T=4, α=0.9) remain best
 - ✅ **ResNet50 recipe validation**: 77-81% gap recovery (lower than ResNet18 but still substantial)
 - ✅ **ImageNet validation**: 26% gap confirms scaling limitations (4 runs)
-- ✅ **165 total experiments**: Main + ablation + KD + ImageNet + alpha/temp studies
+- ✅ **MobileNetV2 baselines**: 21-22% gap (~7x worse than ResNet) - depthwise separable convolutions catastrophically sensitive
+- ✅ **195 total experiments**: Main + ablation + KD + ImageNet + alpha/temp + combined hyperparam + MobileNetV2
 - ✅ **Paper first draft**: All main sections written (13 pages)
-- ✅ **Research prompts**: Created 4 deep research prompts for AI collaboration
+- ✅ **Research prompts**: Created 6 deep research prompts for AI collaboration (all complete)
 - ✅ **Reproducibility appendix**: Documented code → results → paper pipeline
 
 ---