Skip to content

Commit c56cc11

Browse files
committed
Document TTQ gradient flow bug and fix
1 parent 93247b5 commit c56cc11

1 file changed

Lines changed: 17 additions & 1 deletion

File tree

TTQ_VERIFICATION.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Unlike fixed ternary {-1, 0, +1}, TTQ learns three FP32 parameters per layer:
8989
3. **NaN losses:** Parameters could go negative without constraints
9090
- Fixed: Softplus enforcement
9191

92-
4. **CRITICAL: Activation quantization incompatibility** (final fix)
92+
4. **CRITICAL: Activation quantization incompatibility**
9393
- Bug: Mixing BitNet's activation quant/dequant with TTQ weights caused training to fail (stuck at 10%)
9494
- Root cause: TTQ weights are pre-scaled {-wn, 0, +wp} but BitNet's dequant expects unscaled {-1,0,+1}
9595
- Tested 4 configs:
@@ -99,6 +99,22 @@ Unlike fixed ternary {-1, 0, +1}, TTQ learns three FP32 parameters per layer:
9999
- D: Pure TTQ (no activation quant) → **WORKS** (49% accuracy in 2 epochs!)
100100
- Solution: Use pure TTQ as in original paper (ternary weights, FP32 activations)
101101

102+
5. **CRITICAL: Zero gradients to TTQ parameters** (final fix)
103+
- Bug: After fixing activation quantization, training still stuck at 10% on server
104+
- Root cause: PyTorch indexing assignment breaks gradient flow to scalar parameters
105+
106+
```python
107+
quantized[pos_mask] = wp_pos # ❌ Breaks gradients to wp_pos
108+
```
109+
110+
- Diagnosis: Gradient verification showed 0/63 TTQ parameters had gradients
111+
- Additional bug: Softplus initialization caused delta to be 5-7x too large after transformation
112+
- Solution:
113+
- Implement custom `TTQQuantizeFunction` with explicit backward pass
114+
- Gradients: `grad_wp = (grad_output * pos_mask).sum()`
115+
- Initialize parameters with inverse softplus to get correct values after transformation
116+
- Verified: wp.grad=0.047, wn.grad=0.046 (proper gradient flow)
117+
102118
## Expected Behavior
103119

104120
- **Accuracy:** Should achieve ~0.5-1.5% better than BitNet+Recipe (based on literature)

0 commit comments

Comments
 (0)