@@ -89,7 +89,7 @@ Unlike fixed ternary {-1, 0, +1}, TTQ learns three FP32 parameters per layer:
89893 . ** NaN losses:** Parameters could go negative without constraints
9090 - Fixed: Softplus enforcement
9191
92- 4 . ** CRITICAL: Activation quantization incompatibility** (final fix)
92+ 4 . ** CRITICAL: Activation quantization incompatibility**
9393 - Bug: Mixing BitNet's activation quant/dequant with TTQ weights caused training to fail (stuck at 10%)
9494 - Root cause: TTQ weights are pre-scaled {-wn, 0, +wp} but BitNet's dequant expects unscaled {-1,0,+1}
9595 - Tested 4 configs:
@@ -99,6 +99,22 @@ Unlike fixed ternary {-1, 0, +1}, TTQ learns three FP32 parameters per layer:
9999 - D: Pure TTQ (no activation quant) → ** WORKS** (49% accuracy in 2 epochs!)
100100 - Solution: Use pure TTQ as in original paper (ternary weights, FP32 activations)
101101
102+ 5 . ** CRITICAL: Zero gradients to TTQ parameters** (final fix)
103+ - Bug: After fixing activation quantization, training still stuck at 10% on server
104+ - Root cause: PyTorch indexing assignment breaks gradient flow to scalar parameters
105+
106+ ``` python
107+ quantized[pos_mask] = wp_pos # ❌ Breaks gradients to wp_pos
108+ ```
109+
110+ - Diagnosis: Gradient verification showed 0 / 63 TTQ parameters had gradients
111+ - Additional bug: Softplus initialization caused delta to be 5 - 7x too large after transformation
112+ - Solution:
113+ - Implement custom `TTQQuantizeFunction` with explicit backward pass
114+ - Gradients: `grad_wp = (grad_output * pos_mask).sum()`
115+ - Initialize parameters with inverse softplus to get correct values after transformation
116+ - Verified: wp.grad= 0.047 , wn.grad= 0.046 (proper gradient flow)
117+
102118# # Expected Behavior
103119
104120- ** Accuracy:** Should achieve ~ 0.5 - 1.5 % better than BitNet+ Recipe (based on literature)
0 commit comments