You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We conduct a comprehensive empirical study of ternary quantization for CNNs, comprising 135 controlled experiments across three datasets (CIFAR-10, CIFAR-100, Tiny-ImageNet), two architectures (ResNet-18, ResNet-50), and multiple training configurations. Crucially, we include proper baseline controls requested by the research community: \textbf{FP32+KD baselines} that isolate the quantization penalty from knowledge distillation benefits. The experiments uncover three surprising patterns:
26
+
We conduct a comprehensive empirical study of ternary quantization for CNNs, comprising 153 controlled experiments across three datasets (CIFAR-10, CIFAR-100, Tiny-ImageNet), two architectures (ResNet-18, ResNet-50), and multiple training configurations. Crucially, we include proper baseline controls requested by the research community: \textbf{FP32+KD baselines} that isolate the quantization penalty from knowledge distillation benefits. The experiments uncover three surprising patterns:
27
27
28
28
\textbf{1. Layer Sensitivity is Highly Skewed.} Through systematic layer-wise ablation (keeping specific layers in FP32 while quantizing others), we find that the first convolutional layer (\texttt{conv1}) accounts for 30-74\% of recoverable accuracy loss—2.5× more than any other single layer. This asymmetry is consistent across datasets and architectures, pointing to \texttt{conv1} as an information bottleneck where quantization has outsized impact.
This work makes three contributions. First, we establish reproducible methodology where every table, figure, and claim is programmatically generated from documented experiments. Our infrastructure includes complete documentation of 135 experiments with exact commands, deterministic training with seed control, automated analysis pipelines that transform raw results into publication-ready tables, and public code and data for community use. Beyond our specific findings on ternary quantization, we demonstrate best practices for rigorous empirical ML research.
36
+
This work makes three contributions. First, we establish reproducible methodology where every table, figure, and claim is programmatically generated from documented experiments. Our infrastructure includes complete documentation of 153 experiments with exact commands, deterministic training with seed control, automated analysis pipelines that transform raw results into publication-ready tables, and public code and data for community use. Beyond our specific findings on ternary quantization, we demonstrate best practices for rigorous empirical ML research.
37
37
38
38
Second, our systematic investigation quantifies layer-wise sensitivity (\texttt{conv1} accounts for 30-74\% of recoverable accuracy, identifying a clear target for mixed-precision deployment) and documents knowledge distillation's surprising failure mode: KD degrades ternary networks (-0.9\% to -3.1\%) while benefiting FP32 students, revealing capacity-dependent effectiveness. The degradation scales with task complexity (CIFAR-10: -0.9\%, Tiny-ImageNet: -3.1\%), suggesting soft labels overwhelm limited ternary representational capacity—a finding that opens testable future work on lower $\alpha$ values and specialized KD techniques.
Copy file name to clipboardExpand all lines: paper/sections/02_related_work.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ \subsection{BitNet for Vision}
23
23
24
24
For vision, Nielsen and Schneider-Kamp~\cite{nielsen2024bitnetreloaded} applied BitNet b1.58 to small custom CNNs (100K-2.2M parameters) on CIFAR-10, achieving 71.47\% accuracy after only 10 epochs of training. BD-Net~\cite{kim2025bdnet} applied 1.58-bit quantization specifically to MobileNet's depthwise convolutions, reporting improvements over binary baselines.
25
25
26
-
However, no prior work has systematically studied BitNet b1.58 on standard CNN architectures (ResNet) with full training schedules, proper baseline controls (FP32+KD), and investigation of training dynamics (augmentation effects). Our work fills this gap with 135 controlled experiments and complete reproducibility.
26
+
However, no prior work has systematically studied BitNet b1.58 on standard CNN architectures (ResNet) with full training schedules, proper baseline controls (FP32+KD), and investigation of training dynamics (augmentation effects). Our work fills this gap with 153 controlled experiments and complete reproducibility.
Our research infrastructure ensures end-to-end reproducibility through automated pipelines with no manual intervention. The workflow follows five stages: (1) Training code (\texttt{experiments/*.py}) executes deterministic experiments, (2) 135 experiments produce raw JSON results (\texttt{results/raw*/}), (3) aggregation scripts (\texttt{aggregate\_results.py}) extract data to CSV, (4) analysis scripts generate LaTeX tables and PNG figures programmatically, (5) paper artifacts are automatically generated with no manual transcription. Every number in this paper is programmatically verified from raw experimental data.
12
+
Our research infrastructure ensures end-to-end reproducibility through automated pipelines with no manual intervention. The workflow follows five stages: (1) Training code (\texttt{experiments/*.py}) executes deterministic experiments, (2) 153 experiments produce raw JSON results (\texttt{results/raw*/}), (3) aggregation scripts (\texttt{aggregate\_results.py}) extract data to CSV, (4) analysis scripts generate LaTeX tables and PNG figures programmatically, (5) paper artifacts are automatically generated with no manual transcription. Every number in this paper is programmatically verified from raw experimental data.
13
13
14
14
\textbf{Pipeline stages:}
15
15
16
16
\begin{enumerate}
17
-
\item\textbf{Training execution:} All 135 experiments documented in \texttt{PROPER\_BASELINE\_COMMANDS.sh} with exact commands, hyperparameters, and seeds. Each run produces:
17
+
\item\textbf{Training execution:} All 153 experiments documented in \texttt{EXPERIMENTS\_REFERENCE.sh} with exact commands, hyperparameters, and seeds. Each run produces:
18
18
\begin{itemize}
19
19
\item\texttt{results.json}: Final metrics (best accuracy, test accuracy)
20
20
\item\texttt{config.json}: Complete training configuration
We use $p < 0.05$ as significance threshold but emphasize \textit{effect sizes (Cohen's d)} and \textit{practical significance} (absolute accuracy gap) over p-values. With n=3 seeds, statistical power is limited; we focus on large, consistent effects (d > 0.8) as practically meaningful rather than marginal p-values.
250
250
251
-
\textbf{Multiple comparisons:} While we conduct 135 experiments total, these are not 135 independent hypothesis tests—most are exploratory ablations investigating the same core phenomena (layer sensitivity, KD effects) across different architectures and datasets. We report \textit{uncorrected} p-values for transparency and focus statistical testing on key confirmatory comparisons: (1) BitNet vs FP32 (quantization gap), (2) BitNet vs BitNet+KD (KD failure), (3) BitNet+Recipe vs FP32+KD (recipe effectiveness). For exploratory ablations (layer sensitivity, architecture variations), we emphasize effect sizes and replication consistency over p-values.
251
+
\textbf{Multiple comparisons:} While we conduct 153 experiments total, these are not 153 independent hypothesis tests—most are exploratory ablations investigating the same core phenomena (layer sensitivity, KD effects) across different architectures and datasets. We report \textit{uncorrected} p-values for transparency and focus statistical testing on key confirmatory comparisons: (1) BitNet vs FP32 (quantization gap), (2) BitNet vs BitNet+KD (KD failure), (3) BitNet+Recipe vs FP32+KD (recipe effectiveness). For exploratory ablations (layer sensitivity, architecture variations), we emphasize effect sizes and replication consistency over p-values.
Copy file name to clipboardExpand all lines: paper/sections/04_architecture.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -86,4 +86,4 @@ \subsection{Implementation and Implications for Quantization}
86
86
87
87
\textbf{Generalization to Tiny-ImageNet (64×64):} We apply the same principle to Tiny-ImageNet: 3×3 conv stride 1, no maxpool, preserving 64×64 → 64×64 spatial resolution. This is justified by image resolution: 64×64 is closer to CIFAR's 32×32 than ImageNet's 224×224. Our FP32 baselines (67.83\% ResNet-18, 71.77\% ResNet-50) align with published results~\cite{tiny_imagenet_benchmark}, validating our architectural choice.
88
88
89
-
\textbf{Summary:} CIFAR-adapted stems are (1) standard practice in CIFAR literature, (2) empirically validated (our FP32 baselines match published results), (3) necessary for fair comparison (all models use matched architectures), and (4) applied consistently (all 135 experiments use \texttt{--use-cifar-stem}). This architectural foundation ensures our quantization findings reflect true ternary penalties, not architectural mismatches.
89
+
\textbf{Summary:} CIFAR-adapted stems are (1) standard practice in CIFAR literature, (2) empirically validated (our FP32 baselines match published results), (3) necessary for fair comparison (all models use matched architectures), and (4) applied consistently (all 153 experiments use \texttt{--use-cifar-stem}). This architectural foundation ensures our quantization findings reflect true ternary penalties, not architectural mismatches.
Three patterns emerge from our 135 controlled experiments. First, quantization sensitivity is highly asymmetric: conv1 alone accounts for 30-74\% of recoverable accuracy loss despite representing only 0.08\% of parameters—2.5× more effective than any other layer. This asymmetry reveals that information bottlenecks at network entry are more critical for ternary quantization than commonly assumed.
201
+
Three patterns emerge from our 153 controlled experiments. First, quantization sensitivity is highly asymmetric: conv1 alone accounts for 30-74\% of recoverable accuracy loss despite representing only 0.08\% of parameters—2.5× more effective than any other layer. This asymmetry reveals that information bottlenecks at network entry are more critical for ternary quantization than commonly assumed.
202
202
203
203
Second, knowledge distillation exhibits a counterintuitive failure mode in extreme quantization. While KD benefits full-precision students on complex tasks (+0.9\% to +1.6\%), it actively degrades ternary networks (-0.9\% to -3.1\%), with degradation scaling by task complexity. This reveals capacity constraints unique to ternary weights \{-1, 0, +1\}, where soft labels overwhelm limited representational capacity.
Copy file name to clipboardExpand all lines: paper/sections/06_discussion.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -100,7 +100,7 @@ \subsection{Limitations}
100
100
101
101
For example, ResNet-50 CIFAR-10 shows 0.35\% gap with p=0.066 (non-significant). We address this limitation by: (1) reporting Cohen's d effect sizes alongside p-values, (2) focusing on large, consistent effects (d > 0.8), and (3) emphasizing practical significance over marginal p-values. Increasing to n=10 seeds would require re-running 450 experiments (~2,700 additional GPU-hours), which we leave for future work.
102
102
103
-
\textbf{3. Multiple comparisons:} We report uncorrected p-values for 135 experiments without Bonferroni or FDR correction. While most experiments are exploratory ablations rather than independent hypothesis tests, formal correction (Bonferroni $\alpha$/135 = 0.00037) would render many marginally significant results (p $\sim$ 0.01-0.05) non-significant. We mitigate this by: (1) focusing on large effect sizes (Cohen's d > 0.8) rather than marginal p-values, (2) emphasizing replication across datasets and architectures, and (3) clearly distinguishing confirmatory tests (core findings) from exploratory analyses (ablations). Future work should pre-register key hypotheses and apply appropriate corrections.
103
+
\textbf{3. Multiple comparisons:} We report uncorrected p-values for 153 experiments without Bonferroni or FDR correction. While most experiments are exploratory ablations rather than independent hypothesis tests, formal correction (Bonferroni $\alpha$/153 = 0.00033) would render many marginally significant results (p $\sim$ 0.01-0.05) non-significant. We mitigate this by: (1) focusing on large effect sizes (Cohen's d > 0.8) rather than marginal p-values, (2) emphasizing replication across datasets and architectures, and (3) clearly distinguishing confirmatory tests (core findings) from exploratory analyses (ablations). Future work should pre-register key hypotheses and apply appropriate corrections.
104
104
105
105
\textbf{4. CIFAR-scale optimization:} Our training recipe (200 epochs, SGD, basic augmentation) is validated on CIFAR-10, CIFAR-100, and Tiny-ImageNet but not optimized for ImageNet-scale deployment. Modern ternary methods achieve strong ImageNet results (ReActNet: 69.4\%~\cite{liu2020reactnet}, BNext: 80.6\%~\cite{guo2022bnext}) through specialized techniques:
0 commit comments