Fix critical TTQ double-scaling bug

dariocazzani · dariocazzani · commit 90f58ae60df3 · 2026-03-11T08:39:17.000-04:00
Bug: TTQ quantized weights are ALREADY scaled by wp/wn:
  w_quant = {-wn_pos, 0, +wp_pos}

But we were applying beta = (wp_pos + wn_pos) / 2 in dequantization,
scaling AGAIN! This is double-scaling.

BitNet comparison:
  - BitNet: quantize to {-1,0,+1}, then scale with beta
  - TTQ: quantize to {-wn,0,+wp} (already scaled!), beta should be 1.0

Fix: Set beta = 1.0 since TTQ weights are pre-scaled.

This was causing the model to be stuck at 10% accuracy.
diff --git a/bitnet/nn/ttq_conv2d.py b/bitnet/nn/ttq_conv2d.py
@@ -32,11 +32,12 @@ def forward(self, x: Tensor) -> Tensor:
         x = f.layer_norm(x, x.shape[1:])
         x_quant, gamma = quantize_activations(x, self.num_bits)
 
-        # TTQ weight quantization with learned scales
+        # TTQ weight quantization with learned scales (already scaled!)
         w_quant, wp_pos, wn_pos = ttq_quantize(self.weight, self.wp, self.wn, self.delta)
 
-        # Use average of positive scales as beta for dequantization
-        beta = (wp_pos + wn_pos) / 2
+        # Beta = 1.0 because quantized weights are already scaled by wp/wn
+        # Unlike BitNet which scales {-1,0,+1} with beta in dequant, TTQ pre-scales
+        beta = torch.ones_like(wp_pos)
 
         out = f.conv2d(x_quant, w_quant, self.bias, self.stride, self.padding, self.dilation, self.groups)
         return dequantize(out, gamma, beta, self.num_bits)
diff --git a/bitnet/nn/ttq_linear.py b/bitnet/nn/ttq_linear.py
@@ -38,11 +38,12 @@ def forward(self, x: Tensor) -> Tensor:
         x = f.layer_norm(x, x.shape[1:])
         x_quant, gamma = quantize_activations(x, self.num_bits)
 
-        # TTQ weight quantization with learned scales
+        # TTQ weight quantization with learned scales (already scaled!)
         w_quant, wp_pos, wn_pos = ttq_quantize(self.weight, self.wp, self.wn, self.delta)
 
-        # Use average of positive scales as beta for dequantization
-        beta = (wp_pos + wn_pos) / 2
+        # Beta = 1.0 because quantized weights are already scaled by wp/wn
+        # Unlike BitNet which scales {-1,0,+1} with beta in dequant, TTQ pre-scales
+        beta = torch.ones_like(wp_pos)
 
         out = f.linear(x_quant, w_quant, self.bias)
         return dequantize(out, gamma, beta, self.num_bits)