Skip to content

Commit bee3626

Browse files
committed
Update PLAN with MobileNetV2 results (Key Finding #10)
- MobileNetV2 shows ~7x larger accuracy gap than ResNet (21-22% vs 3-4%) - Depthwise separable convolutions accumulate more quantization error - Update experiment count from 177 to 195
1 parent 110973f commit bee3626

File tree

1 file changed

+110
-26
lines changed

1 file changed

+110
-26
lines changed

β€ŽPLAN.mdβ€Ž

Lines changed: 110 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BitNet CNN Research Plan
22

3-
**Last Updated**: Feb 5, 2026
3+
**Last Updated**: Feb 6, 2026
44
**Primary Target**: CVPR 2026 Workshop (~Apr 2026)
55
**Parallel Track**: NeurIPS 2026 Efficient ML Workshop (Aug 2026)
66
**Backup**: WACV 2027 Round 2 (Sept 2026)
@@ -44,7 +44,7 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
4444

4545
---
4646

47-
## Current State (165 experiments completed)
47+
## Current State (195 experiments completed)
4848

4949
### Results Summary
5050

@@ -239,19 +239,85 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
239239

240240
**Insight**: Larger models are harder to fully recover. More parameters = more information lost to quantization. ResNet18 remains the sweet spot for ternary deployment.
241241

242-
### Key Finding #8: Alpha Ablation ⭐ IN PROGRESS
242+
### Key Finding #8: Alpha Ablation βœ… COMPLETE
243243

244-
**Surprise**: Ξ±=0.5 (equal hard/soft weight) beats literature default Ξ±=0.9!
244+
**Surprise**: Lower Ξ± (more hard labels) works better, especially on harder tasks!
245245

246-
| Alpha | CIFAR-10 (conv1+KD) |
247-
|-------|---------------------|
248-
| Ξ±=0.9 (default) | 88.66% |
249-
| Ξ±=0.7 | 88.73% |
250-
| **Ξ±=0.5** | **88.90%** |
246+
**CIFAR-10 Results**:
247+
| Alpha | Seed 42 | Seed 123 | Seed 456 | Mean Β± Std |
248+
|-------|---------|----------|----------|------------|
249+
| Ξ±=0.9 (default) | 88.66% | - | - | - |
250+
| Ξ±=0.7 | **88.73%** | 88.43% | 88.52% | 88.56 Β± 0.13% |
251+
| Ξ±=0.5 | 88.90% | 88.40% | 88.45% | 88.58 Β± 0.23% |
252+
| Ξ±=0.4 | 88.60% | - | - | - |
253+
| Ξ±=0.3 | 88.71% | - | - | - |
251254

252-
**Implication**: For ternary networks, equal weighting of hard and soft labels works better than the heavy soft-label weighting recommended in standard KD literature.
255+
**CIFAR-100 Results**:
256+
| Alpha | Accuracy |
257+
|-------|----------|
258+
| Ξ±=0.9 (default) | 63.40% |
259+
| Ξ±=0.7 | 63.35% |
260+
| **Ξ±=0.5** | **63.82%** |
253261

254-
**Status**: πŸ”„ Running CIFAR-100 + fine-grained search + more seeds
262+
**Key insights**:
263+
- CIFAR-10: Ξ±=0.7 most consistent (lowest variance), Ξ±=0.5 has highest single-run but high variance
264+
- CIFAR-100: Ξ±=0.5 clearly best (+0.42% over default)
265+
- Harder tasks benefit more from hard labels (lower Ξ±)
266+
- Standard KD literature recommends Ξ±=0.9, but ternary networks prefer Ξ±=0.5-0.7
267+
268+
**Status**: βœ… Complete (12 alpha experiments + 3-seed validation)
269+
270+
### Key Finding #9: Combined Hyperparameters ⚠️ NEGATIVE RESULT
271+
272+
**Tested whether combining "optimal" T and Ξ± improves over defaults.**
273+
274+
**CIFAR-10 (T=5, Ξ±=0.7) vs Default (T=4, Ξ±=0.9)**:
275+
| Config | Seed 42 | Seed 123 | Seed 456 | Mean Β± Std |
276+
|--------|---------|----------|----------|------------|
277+
| Default (T=4, Ξ±=0.9) | 88.66% | 88.53% | 88.25% | **88.48 Β± 0.17%** |
278+
| Optimized (T=5, Ξ±=0.7) | 88.48% | 88.56% | 88.35% | 88.46 Β± 0.09% |
279+
280+
**CIFAR-100 (T=6, Ξ±=0.5) vs Default (T=4, Ξ±=0.9)**:
281+
| Config | Seed 42 | Seed 123 | Seed 456 | Mean Β± Std |
282+
|--------|---------|----------|----------|------------|
283+
| Default (T=4, Ξ±=0.9) | 63.41% | 63.48% | 63.30% | **63.40 Β± 0.07%** |
284+
| Optimized (T=6, Ξ±=0.5) | 62.95% | 63.00% | 62.77% | 62.91 Β± 0.10% |
285+
286+
**Key finding**: Individual ablations showed T=5/T=6 and Ξ±=0.5-0.7 were better in isolation, but they have **negative interaction** when combined:
287+
- CIFAR-10: No improvement (88.46% vs 88.48%)
288+
- CIFAR-100: Actually **worse** by 0.5% (62.91% vs 63.40%)
289+
290+
**Implication for paper**: Keep T=4, Ξ±=0.9 as the recipe. The recipe works **out-of-the-box without tuning**.
291+
292+
**Status**: βœ… Complete (6 runs across 2 datasets Γ— 3 seeds)
293+
294+
### Key Finding #10: MobileNetV2 Architecture Sensitivity βœ… NEW
295+
296+
**MobileNetV2 has ~7x larger accuracy gap than ResNet - depthwise separable convolutions are catastrophically sensitive to ternary quantization.**
297+
298+
| Model | Dataset | FP32 | BitNet | Gap |
299+
|-------|---------|------|--------|-----|
300+
| MobileNetV2 | CIFAR-10 | 84.63 Β± 0.48% | 63.05 Β± 2.69% | **-21.57%** |
301+
| MobileNetV2 | CIFAR-100 | 56.10 Β± 0.20% | 33.51 Β± 3.30% | **-22.59%** |
302+
303+
**Comparison with ResNet18:**
304+
| Model | CIFAR-10 Gap | CIFAR-100 Gap |
305+
|-------|--------------|---------------|
306+
| ResNet18 | 2.95% | 3.72% |
307+
| MobileNetV2 | **21.57%** | **22.59%** |
308+
| Ratio | **~7.3x worse** | **~6.1x worse** |
309+
310+
**Key insights:**
311+
- Depthwise separable convolutions accumulate significantly more quantization error than standard convolutions
312+
- High variance in BitNet results (2.69-3.30%) vs ResNet (~0.5%) suggests training instability
313+
- Validates literature: "MobileNets are dramatically more sensitive to quantization at any bit-width" (CVPR 2021)
314+
- **Strengthens our paper**: The conv1+KD recipe becomes even more critical for efficient architectures
315+
316+
**Implication for paper**: Add MobileNetV2 results to Discussion as "architectural limitations" - some architectures need more aggressive intervention than conv1+KD.
317+
318+
**Status**: βœ… Complete (12 runs: 2 datasets Γ— 2 versions Γ— 3 seeds)
319+
320+
---
255321

256322
### TTQ Comparison Strategy
257323

@@ -261,6 +327,19 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
261327
262328
This reframes the gap as an **intentional design tradeoff** for deployment efficiency.
263329

330+
### Architecture-Specific Layer Names
331+
332+
Different architectures name their first convolutional layer differently. For the `keep_conv1` ablation:
333+
334+
| Architecture | First Conv Layer Name | Notes |
335+
|--------------|----------------------|-------|
336+
| ResNet18/50 | `conv1` | βœ… Supported |
337+
| MobileNetV2 | `conv_stem` | ⚠️ Need to add `KEEP_CONV_STEM` to `AblationMode` |
338+
| EfficientNet | `conv_stem` | Same as MobileNetV2 |
339+
| VGG16 | `features.0` | Different naming convention |
340+
341+
**To add MobileNetV2 ablation support**: Update `experiments/config.py` to add `KEEP_CONV_STEM = "keep_conv_stem"` and map it to `{"conv_stem"}` in `ABLATION_SKIP_LAYERS`.
342+
264343
---
265344

266345
## Immediate Next Steps
@@ -293,11 +372,12 @@ Based on KD research, these experiments could further improve results:
293372

294373
| Temperature | CIFAR-10 | CIFAR-100 |
295374
|-------------|----------|-----------|
296-
| T=4 (default) | **88.66%** | 63.40% |
375+
| T=4 (default) | 88.66% | 63.40% |
376+
| **T=5** | **88.79%** | 63.20% |
297377
| T=6 | 88.23% | **63.89%** |
298378
| T=8 | 88.34% | 62.04% |
299379

300-
**Conclusion**: T=4 optimal for CIFAR-10, T=6 slightly better for CIFAR-100 (+0.49%). Higher temperatures hurt on both.
380+
**Conclusion**: T=5 optimal for CIFAR-10 (+0.13%), T=6 for CIFAR-100 (+0.49%).
301381

302382
#### 2b. CIFAR-100 conv1+KD (Validates Recipe) βœ… COMPLETE
303383

@@ -313,19 +393,20 @@ Based on KD research, these experiments could further improve results:
313393

314394
**Key insight**: On harder tasks, KD provides stronger regularization, enabling ternary networks to surpass full-precision baselines.
315395

316-
#### 2c. Alpha Ablation ⭐ IN PROGRESS
396+
#### 2c. Alpha Ablation βœ… COMPLETE
317397

318-
**Surprise finding**: Ξ±=0.5 (equal hard/soft) beats literature default Ξ±=0.9!
398+
**Finding**: Lower Ξ± (more hard labels) works better, especially on harder tasks.
319399

320-
| Alpha | CIFAR-10 Accuracy |
321-
|-------|-------------------|
322-
| Ξ±=0.9 (default) | 88.66% |
323-
| Ξ±=0.7 | 88.73% |
324-
| **Ξ±=0.5** | **88.90%** |
400+
**CIFAR-10 (3 seeds)**:
401+
| Alpha | Mean Β± Std | Best Single |
402+
|-------|------------|-------------|
403+
| Ξ±=0.9 (default) | - | 88.66% |
404+
| Ξ±=0.7 | 88.56 Β± 0.13% | 88.73% |
405+
| Ξ±=0.5 | 88.58 Β± 0.23% | 88.90% |
325406

326-
**Implication**: Equal weight on hard and soft labels works better than the literature default for ternary networks.
407+
**CIFAR-100**: Ξ±=0.5 achieves **63.82%** (+0.42% over default Ξ±=0.9)
327408

328-
**Status**: πŸ”„ Running more experiments (CIFAR-100 + fine-grained search + more seeds)
409+
**Conclusion**: Use Ξ±=0.7 for CIFAR-10 (lowest variance), Ξ±=0.5 for CIFAR-100 (best accuracy). Harder tasks benefit more from hard labels.
329410

330411
#### 2d. Feature Distillation (Bigger Effort, Bigger Gain)
331412
**Goal**: Implement DCQ/OFF-style feature distillation.
@@ -354,8 +435,9 @@ Based on KD research, these experiments could further improve results:
354435

355436
### Currently Running
356437

438+
- πŸ”„ **ResNet50 optimized hyperparams**: Testing T=5/Ξ±=0.7 (CIFAR-10) and T=6/Ξ±=0.5 (CIFAR-100)
439+
- Expected: Likely same negative result as ResNet18 (defaults work best)
357440
- πŸ”„ **ImageNet recipe**: keep_conv1 and keep_conv1+KD on ImageNet
358-
- πŸ”„ **Alpha ablation**: CIFAR-100 (Ξ±=0.5, 0.7) + fine-grained search (Ξ±=0.3, 0.4, 0.6) + more seeds
359441

360442
---
361443

@@ -368,12 +450,14 @@ Based on KD research, these experiments could further improve results:
368450
- βœ… **conv1 + KD combo (CIFAR-10)**: 88.48 Β± 0.17% accuracy (88% gap recovery)
369451
- βœ… **conv1 + KD combo (CIFAR-100)**: 63.40 Β± 0.09% accuracy (**exceeds FP32 by 1.0%!**)
370452
- βœ… **Temperature ablation**: T=4 optimal CIFAR-10, T=6 slightly better CIFAR-100
371-
- βœ… **Alpha ablation (initial)**: Ξ±=0.5 beats Ξ±=0.9 on CIFAR-10 (88.90% vs 88.66%)
453+
- βœ… **Alpha ablation**: Ξ±=0.7 best for CIFAR-10 (lowest variance), Ξ±=0.5 best for CIFAR-100 (+0.42%)
454+
- βœ… **Combined hyperparameters**: ⚠️ Negative result - combining optimal T+Ξ± doesn't improve, defaults (T=4, Ξ±=0.9) remain best
372455
- βœ… **ResNet50 recipe validation**: 77-81% gap recovery (lower than ResNet18 but still substantial)
373456
- βœ… **ImageNet validation**: 26% gap confirms scaling limitations (4 runs)
374-
- βœ… **165 total experiments**: Main + ablation + KD + ImageNet + alpha/temp studies
457+
- βœ… **MobileNetV2 baselines**: 21-22% gap (~7x worse than ResNet) - depthwise separable convolutions catastrophically sensitive
458+
- βœ… **195 total experiments**: Main + ablation + KD + ImageNet + alpha/temp + combined hyperparam + MobileNetV2
375459
- βœ… **Paper first draft**: All main sections written (13 pages)
376-
- βœ… **Research prompts**: Created 4 deep research prompts for AI collaboration
460+
- βœ… **Research prompts**: Created 6 deep research prompts for AI collaboration (all complete)
377461
- βœ… **Reproducibility appendix**: Documented code β†’ results β†’ paper pipeline
378462

379463
---

0 commit comments

Comments
Β (0)