You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update PLAN with MobileNetV2 results (Key Finding #10)
- MobileNetV2 shows ~7x larger accuracy gap than ResNet (21-22% vs 3-4%)
- Depthwise separable convolutions accumulate more quantization error
- Update experiment count from 177 to 195
**Parallel Track**: NeurIPS 2026 Efficient ML Workshop (Aug 2026)
6
6
**Backup**: WACV 2027 Round 2 (Sept 2026)
@@ -44,7 +44,7 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
44
44
45
45
---
46
46
47
-
## Current State (165 experiments completed)
47
+
## Current State (195 experiments completed)
48
48
49
49
### Results Summary
50
50
@@ -239,19 +239,85 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
239
239
240
240
**Insight**: Larger models are harder to fully recover. More parameters = more information lost to quantization. ResNet18 remains the sweet spot for ternary deployment.
241
241
242
-
### Key Finding #8: Alpha Ablation β IN PROGRESS
242
+
### Key Finding #8: Alpha Ablation β COMPLETE
243
243
244
-
**Surprise**: Ξ±=0.5 (equal hard/soft weight) beats literature default Ξ±=0.9!
244
+
**Surprise**: Lower Ξ± (more hard labels) works better, especially on harder tasks!
**Implication**: For ternary networks, equal weighting of hard and soft labels works better than the heavy soft-label weighting recommended in standard KD literature.
255
+
**CIFAR-100 Results**:
256
+
| Alpha | Accuracy |
257
+
|-------|----------|
258
+
| Ξ±=0.9 (default) | 63.40% |
259
+
| Ξ±=0.7 | 63.35% |
260
+
|**Ξ±=0.5**|**63.82%**|
253
261
254
-
**Status**: π Running CIFAR-100 + fine-grained search + more seeds
262
+
**Key insights**:
263
+
- CIFAR-10: Ξ±=0.7 most consistent (lowest variance), Ξ±=0.5 has highest single-run but high variance
264
+
- CIFAR-100: Ξ±=0.5 clearly best (+0.42% over default)
265
+
- Harder tasks benefit more from hard labels (lower Ξ±)
266
+
- Standard KD literature recommends Ξ±=0.9, but ternary networks prefer Ξ±=0.5-0.7
- Depthwise separable convolutions accumulate significantly more quantization error than standard convolutions
312
+
- High variance in BitNet results (2.69-3.30%) vs ResNet (~0.5%) suggests training instability
313
+
- Validates literature: "MobileNets are dramatically more sensitive to quantization at any bit-width" (CVPR 2021)
314
+
-**Strengthens our paper**: The conv1+KD recipe becomes even more critical for efficient architectures
315
+
316
+
**Implication for paper**: Add MobileNetV2 results to Discussion as "architectural limitations" - some architectures need more aggressive intervention than conv1+KD.
@@ -261,6 +327,19 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
261
327
262
328
This reframes the gap as an **intentional design tradeoff** for deployment efficiency.
263
329
330
+
### Architecture-Specific Layer Names
331
+
332
+
Different architectures name their first convolutional layer differently. For the `keep_conv1` ablation:
333
+
334
+
| Architecture | First Conv Layer Name | Notes |
335
+
|--------------|----------------------|-------|
336
+
| ResNet18/50 |`conv1`| β Supported |
337
+
| MobileNetV2 |`conv_stem`| β οΈ Need to add `KEEP_CONV_STEM` to `AblationMode`|
338
+
| EfficientNet |`conv_stem`| Same as MobileNetV2 |
339
+
| VGG16 |`features.0`| Different naming convention |
340
+
341
+
**To add MobileNetV2 ablation support**: Update `experiments/config.py` to add `KEEP_CONV_STEM = "keep_conv_stem"` and map it to `{"conv_stem"}` in `ABLATION_SKIP_LAYERS`.
342
+
264
343
---
265
344
266
345
## Immediate Next Steps
@@ -293,11 +372,12 @@ Based on KD research, these experiments could further improve results:
293
372
294
373
| Temperature | CIFAR-10 | CIFAR-100 |
295
374
|-------------|----------|-----------|
296
-
| T=4 (default) |**88.66%**| 63.40% |
375
+
| T=4 (default) | 88.66% | 63.40% |
376
+
|**T=5**|**88.79%**| 63.20% |
297
377
| T=6 | 88.23% |**63.89%**|
298
378
| T=8 | 88.34% | 62.04% |
299
379
300
-
**Conclusion**: T=4 optimal for CIFAR-10, T=6 slightly better for CIFAR-100 (+0.49%). Higher temperatures hurt on both.
380
+
**Conclusion**: T=5 optimal for CIFAR-10 (+0.13%), T=6 for CIFAR-100 (+0.49%).
0 commit comments