11# BitNet CNN Research Plan
22
3- ** Last Updated** : Feb 9 , 2026
3+ ** Last Updated** : Feb 10 , 2026
44** Primary Target** : CVPR 2026 Workshop (~ Apr 2026)
55** Parallel Track** : NeurIPS 2026 Efficient ML Workshop (Aug 2026)
66** Backup** : WACV 2027 Round 2 (Sept 2026)
1111
1212** Title** : "When Augmentation Fails: Knowledge Distillation for Ternary CNNs"
1313
14- ** Paper file** : ` paper/main.tex ` (13 pages, builds successfully)
14+ ** Paper file** : ` paper/main.tex ` (14 pages, builds successfully)
1515
1616### Sections Completed
1717
2525- ✅ Discussion
2626- ✅ Conclusion
2727- ✅ Reproducibility Appendix (code → results → paper pipeline)
28+ - ✅ Information Theory Appendix (theoretical grounding via DPI, channel capacity)
2829
2930### Still Needed
3031
@@ -174,14 +175,18 @@ Systematic study of BitNet b1.58 (1.58-bit ternary quantization) applied to stan
174175| BitNet + keep_conv1 | 35.60% | 35.60% | Worse than baseline (39.64%) |
175176| BitNet + keep_conv1 + KD | 11.27% | 11.27% | Near-random (catastrophic) |
176177
177- ** Key insight** : The recipe that works on CIFAR catastrophically fails on ImageNet. Possible causes:
178- - ImageNet requires architecture-specific tuning (like MobileNetV2 needed lr=0.01)
179- - 90 epochs may be insufficient for KD at this scale
180- - Teacher-student capacity gap may be too large
178+ ** Key insight** : The recipe that works on CIFAR catastrophically fails on ImageNet. ** Root cause identified via research (07_imagenet_kd_failure.md)** :
181179
182- ** Decision** : Pivoting to ** Tiny ImageNet** (200 classes, 64x64 images) for faster iteration and manageable training times (~ 2-3h vs 24-48h per run).
180+ 1 . ** Optimizer mismatch** : We used SGD (CIFAR default), but ReActNet/BNext use Adam with 1e-3 lr
181+ 2 . ** Insufficient epochs** : 90 epochs vs 256-512 needed for binary/ternary convergence
182+ 3 . ** Wrong KD hyperparameters** : α=0.9 overwhelms ternary capacity at 1000 classes (need α=0.1-0.5)
183+ 4 . ** Missing techniques** : No progressive quantization, no learned activations (RPReLU)
183184
184- ** Status** : ⚠️ Baselines complete, recipe failed - pivoting to Tiny ImageNet
185+ ** This is NOT a fundamental limitation** - it's a training recipe problem. BNext achieves 80.57% on ImageNet with proper training.
186+
187+ ** Decision** : Pivoting to ** Tiny ImageNet** (200 classes, 64x64 images) for faster iteration. ImageNet-scale success requires adopting ReActNet/BNext training recipes (future work).
188+
189+ ** Status** : ⚠️ Baselines complete, recipe failed - root cause understood, pivoting to Tiny ImageNet
185190
186191### Key Finding #6b: Tiny ImageNet Validation ✅ COMPLETE
187192
@@ -456,21 +461,28 @@ Different architectures name their first convolutional layer differently. For th
456461
457462### 🎯 PRIORITY 1: Deep Literature Research ✅ COMPLETE
458463
459- ** Status** : All 4 research prompts completed and integrated into paper.
464+ ** Status** : All 9 research prompts completed and integrated into paper.
460465
461466| File | Topic | Key Finding |
462467| ------| -------| -------------|
463468| ` 01_ttq_comparison.md ` | TTQ vs BitNet gap | TTQ uses learned asymmetric scales (W^p, W^n) + keeps conv1/FC in FP32. Our gap explained by simpler formulation. |
464469| ` 02_layer_sensitivity_literature.md ` | Layer sensitivity | conv1 FP32 is standard since 2016. ** Our contribution: precise quantification (54-74%)** . |
465470| ` 03_kd_for_quantization.md ` | KD literature | T=4 may be suboptimal (T=6-8 better). ** Feature distillation could add 5-20% more recovery** . |
466471| ` 04_bitnet_cnn_prior_work.md ` | Prior BitNet CNN work | Novelty confirmed: first ResNet study with full training + augmentation analysis. |
472+ | ` 05_alpha_hard_label_preference.md ` | Hard label preference | Ternary networks prefer lower α (0.5-0.7) due to limited capacity to mimic soft distributions. |
473+ | ` 06_capacity_gap_scaling.md ` | Gap scaling with complexity | Information bottleneck explains why gap grows with task complexity (3.5%→4.3%→5.8%→26%). |
474+ | ` 07_imagenet_kd_failure.md ` | ImageNet KD failure | ** Root cause: training recipe mismatch** (SGD→Adam, 90→256-512 epochs, α=0.9 too high for 1000 classes). Not fundamental limitation. |
475+ | ` 08_bnext_reactnet_techniques.md ` | BNext/ReActNet techniques | 6 techniques explain 80%+ accuracy: Adam optimizer (+8-12%), longer training (+5-8%), progressive quantization, learned activations. |
476+ | ` 09_information_theory_appendix.md ` | Theoretical grounding | DPI explains conv1 criticality; channel capacity bounds explain gap scaling. Publication-ready appendix. |
467477
468478** Paper updates made** :
469- - ✅ Related Work rewritten with proper framing and 6 new citations
479+ - ✅ Related Work rewritten with proper framing and 13 new citations
470480- ✅ Introduction updated to acknowledge conv1 FP32 as established practice
471481- ✅ Contributions list refined ("quantified layer sensitivity" not "discovery")
472482- ✅ Discussion added TTQ comparison paragraph (design tradeoff)
473- - ✅ Future Work expanded with feature distillation and temperature tuning
483+ - ✅ Future Work expanded with feature distillation, temperature tuning, ReActNet/BNext techniques
484+ - ✅ Limitations section updated with ImageNet failure root cause analysis
485+ - ✅ Information Theory Appendix (Appendix B) added with DPI and channel capacity analysis
474486
475487### 🎯 PRIORITY 2: New Experiments from Research Findings
476488
@@ -541,71 +553,89 @@ Based on KD research, these experiments could further improve results:
541553- [ ] Final proofread
542554- [ ] Complete Tiny ImageNet validation (replacement for ImageNet recipe)
543555
544- ### 🎯 PRIORITY 4: Information Theory Appendix (Optional but Recommended)
556+ ### 🎯 PRIORITY 4: Information Theory Appendix ✅ COMPLETE
545557
546- ** Goal ** : Add theoretical grounding for empirical observations via information-theoretic analysis .
558+ ** Status ** : Appendix B added to paper with full theoretical grounding .
547559
548- ** Rationale ** : The paper currently has one brief mention of information theory (line ~ 397 in Discussion). Adding a dedicated appendix would :
549- 1 . Explain WHY the gap scales with task complexity (not just observe it)
550- 2 . Provide predictive power (when will ternary fail?)
551- 3 . Differentiate from purely empirical work
552- 4 . Appeal to theory-minded reviewers
560+ ** Contents ** :
561+ - Why 1.58 Bits: Entropy calculation H(W) = log₂(3) ≈ 1.585 bits
562+ - Channel Capacity View: Quantization as capacity-limited channel
563+ - Why conv1 Matters: Data Processing Inequality explains irrecoverable bottleneck
564+ - Gap Scaling: Output entropy H(Y) = log₂(C) explains complexity scaling
553565
554- ** Recommended location ** : Appendix B (after Reproducibility appendix)
566+ ** Key insight ** : The DPI formally proves that information lost at conv1 cannot be recovered by later layers, providing theoretical justification for our empirical finding that conv1 accounts for 54-74% of the accuracy gap.
555567
556- ** Proposed content ( ~ 1 page) ** :
568+ ** Citations added ** : tishby2015deep (Information Bottleneck)
557569
558- ``` latex
559- \section{Information-Theoretic Perspective}
570+ ---
560571
561- \subsection{Why 1.58 Bits?}
562- - Entropy of ternary: H(W) = log₂(3) ≈ 1.585 bits (maximum, uniform distribution)
563- - Actual entropy depends on weight distribution (measure from trained models)
564- - Compression ratio derivation: 32 / 1.58 = 20.3×
572+ ### Currently Running (Wave 2 - Feb 11, 2026)
565573
566- \subsection{Channel Capacity View}
567- - Neural network layer as noisy channel
568- - Quantization reduces channel capacity: C_ternary << C_FP32
569- - Bounds mutual information I(X; Y) between input and output
570- - Explains why complex tasks (high H(Y) = log₂(classes)) suffer more
574+ ** GPU 0: EfficientNet-B0 KD** (6 experiments)
575+ - EfficientNet-B0 KD CIFAR-10 seeds 42, 123, 456
576+ - EfficientNet-B0 KD CIFAR-100 seeds 42, 123, 456
571577
572- \subsection{Why conv1 Matters: Information Bottleneck}
573- - Data Processing Inequality: I(X; T₁) ≥ I(X; T₂) ≥ ... ≥ I(X; Y)
574- - Quantizing conv1 aggressively reduces I(X; T₁)
575- - Creates irrecoverable bottleneck - later layers can't recover lost information
576- - Theoretical justification for keeping conv1 in FP32
578+ ** GPU 1: Hyperparameter Debug** (8 experiments)
579+ - ConvNeXt-Tiny FP32/BitNet with lr=0.01, lr=0.004
580+ - MobileNetV2 BitNet with lr=0.002, lr=0.005, lr=0.01 (seeds 789, 999)
577581
578- \subsection{Gap Scaling Prediction}
579- - Output entropy requirement: H(Y) = log₂(C) grows with classes
580- - Ternary capacity is fixed
581- - Predicts gap should grow with task complexity (matches our observations)
582- - Could predict failure modes for new tasks without running experiments
583- ```
582+ ### Wave 1 Results (Feb 11, 2026)
583+
584+ | Model | Dataset | FP32 | BitNet | Gap | Status |
585+ | -------| ---------| ------| --------| -----| --------|
586+ | ** EfficientNet-B0** | CIFAR-10 | 84.91% | 79.29% | 5.6% | ✅ Ready for KD |
587+ | ** EfficientNet-B0** | CIFAR-100 | 56.92% | 46.19% | 10.7% | ✅ Ready for KD |
588+ | ** ConvNeXt-Tiny** | CIFAR-10 | 67.22% | 71.38% | ** -4.2%** | ⚠️ BROKEN (BitNet > FP32?!) |
589+ | ** ConvNeXt-Tiny** | CIFAR-100 | 36.95% | 41.65% | ** -4.7%** | ⚠️ BROKEN |
590+ | ** MobileNetV2** | CIFAR-10 | 84.63% | 67.00% | 17.6% | ⚠️ High variance (std=14.76%) |
591+ | ** MobileNetV2** | CIFAR-100 | 56.10% | 34.68% | 21.4% | ⚠️ Unstable |
584592
585- ** Key equations to include** :
586- - H(W) = -Σ p(w) log₂ p(w) ≤ log₂(3)
587- - I(X; Y) ≤ min(H(X), H(Y), C_channel)
588- - Data Processing Inequality chain
593+ ** Issues identified:**
594+ 1 . ** ConvNeXt** : lr=0.1 too high → FP32 only 67% (should be ~ 85-90%). Testing lr=0.01, lr=0.004.
595+ 2 . ** MobileNetV2 BitNet** : Huge variance at lr=0.01. Testing lr=0.002, lr=0.005.
589596
590- ** Citations to add** :
591- - Blumenfeld, Gilboa & Soudry (NeurIPS 2019) - mean field theory of quantized networks
592- - Tishby & Zaslavsky (2015) - Information Bottleneck
593- - Wang & Scott (ICLR 2022) - VC dimension bounds for quantized networks
597+ ### Wave 2 Commands
594598
595- ** Effort estimate** : ~ 2-3 hours to write, no new experiments needed
599+ ** GPU 0 - EfficientNet KD:**
600+ ``` bash
601+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar10 --ablation keep_conv_stem --lr 0.01 --seed 42 --teacher-path results/raw/cifar10/efficientnet_b0/std_lr0.01_s42/best_model.pth
602+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar10 --ablation keep_conv_stem --lr 0.01 --seed 123 --teacher-path results/raw/cifar10/efficientnet_b0/std_lr0.01_s123/best_model.pth
603+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar10 --ablation keep_conv_stem --lr 0.01 --seed 456 --teacher-path results/raw/cifar10/efficientnet_b0/std_lr0.01_s456/best_model.pth
604+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar100 --ablation keep_conv_stem --lr 0.01 --seed 42 --teacher-path results/raw/cifar100/efficientnet_b0/std_lr0.01_s42/best_model.pth
605+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar100 --ablation keep_conv_stem --lr 0.01 --seed 123 --teacher-path results/raw/cifar100/efficientnet_b0/std_lr0.01_s123/best_model.pth
606+ CUDA_VISIBLE_DEVICES=0 uv run python -m experiments.train_kd --model efficientnet_b0 --dataset cifar100 --ablation keep_conv_stem --lr 0.01 --seed 456 --teacher-path results/raw/cifar100/efficientnet_b0/std_lr0.01_s456/best_model.pth
607+ ```
596608
597- ** Decision** : Add as appendix (not main text) to preserve practical focus of paper. Main text keeps brief mention, appendix provides depth for interested readers.
609+ ** GPU 1 - Hyperparameter Debug:**
610+ ``` bash
611+ # ConvNeXt LR sweep
612+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model convnext_tiny --dataset cifar10 --lr 0.01 --seed 42
613+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model convnext_tiny --dataset cifar10 --lr 0.004 --seed 42
614+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model convnext_tiny --dataset cifar10 --bit-version --lr 0.01 --seed 42
615+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model convnext_tiny --dataset cifar10 --bit-version --lr 0.004 --seed 42
616+
617+ # MobileNetV2 BitNet LR sweep
618+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model mobilenetv2_100 --dataset cifar10 --bit-version --lr 0.002 --seed 42
619+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model mobilenetv2_100 --dataset cifar10 --bit-version --lr 0.005 --seed 42
620+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model mobilenetv2_100 --dataset cifar10 --bit-version --lr 0.01 --seed 789
621+ CUDA_VISIBLE_DEVICES=1 uv run python -m experiments.train --model mobilenetv2_100 --dataset cifar10 --bit-version --lr 0.01 --seed 999
622+ ```
598623
599- ** Status ** : 📝 TODO (optional enhancement )
624+ ### Wave 3 (After Debug Results )
600625
601- ---
626+ Once we identify working LR for ConvNeXt/MobileNetV2:
627+ 1 . Re-run full baselines (3 seeds × 2 datasets) with correct LR
628+ 2 . Run KD experiments with correct LR
602629
603- ### Currently Running
630+ ### Recently Completed
604631
632+ - ✅ ** Wave 1 baselines** (Feb 11): EfficientNet-B0 ✅, ConvNeXt ⚠️, MobileNetV2 ⚠️
605633- ✅ ** MobileNetV2 3-seed validation** (lr=0.01): Complete - CIFAR-10 stable, CIFAR-100 still unstable
606634- ✅ ** MobileNetV2 KD + keep_conv_stem** (lr=0.01): Complete - 79.63% mean (39% gap recovery)
607635- ✅ ** Tiny ImageNet baselines** (3 seeds): Complete - FP32 54.85%, BitNet 49.04% (5.81% gap)
608636- ✅ ** Tiny ImageNet recipe** (KD + keep_conv1): Complete - ** 56.15% (122% recovery, exceeds FP32!)**
637+ - ✅ ** Information Theory Appendix** : Added to paper (Appendix B)
638+ - ✅ ** Research prompts 07-09** : ImageNet failure, BNext/ReActNet techniques, information theory
609639
610640---
611641
@@ -665,9 +695,10 @@ BitNet baselines with lr=0.01 done for CIFAR-10 and CIFAR-100.
665695- ✅ ** ImageNet validation** : 26% gap confirms scaling limitations (4 runs)
666696- ✅ ** MobileNetV2 baselines** : 21-22% gap (~ 7x worse than ResNet) - depthwise separable convolutions catastrophically sensitive
667697- ✅ ** 195 total experiments** : Main + ablation + KD + ImageNet + alpha/temp + combined hyperparam + MobileNetV2
668- - ✅ ** Paper first draft** : All main sections written (13 pages)
669- - ✅ ** Research prompts** : Created 6 deep research prompts for AI collaboration (all complete)
698+ - ✅ ** Paper first draft** : All main sections written (14 pages with appendices )
699+ - ✅ ** Research prompts** : Created 9 deep research prompts for AI collaboration (all complete)
670700- ✅ ** Reproducibility appendix** : Documented code → results → paper pipeline
701+ - ✅ ** Information theory appendix** : DPI and channel capacity analysis (Appendix B)
671702
672703---
673704
@@ -690,7 +721,7 @@ BitNet baselines with lr=0.01 done for CIFAR-10 and CIFAR-100.
690721| -------| -------|
691722| ** Jan** | ✅ Layer-wise ablation, efficiency metrics, KD experiment |
692723| ** Feb (now)** | ✅ conv1+KD combo, ✅ Paper first draft, 🔄 ImageNet, 📝 CIFAR-100 conv1+KD |
693- | ** Feb-Mar** | Deep research integration, figures, CIFAR-100 conv1+KD results |
724+ | ** Feb-Mar** | Architecture extension (EfficientNet, ConvNeXt), figures, final polish |
694725| ** Mar** | Polish, internal review |
695726| ** ~ Apr** | Submit to ** CVPR 2026 Workshop** (primary target) |
696727| ** Aug** | Submit to NeurIPS 2026 Efficient ML Workshop (if needed) |
0 commit comments