Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580) by anthony-maio · Pull Request #932 · openai/parameter-golf

anthony-maio · 2026-03-27T04:33:35Z

Non-Record Submission: CoDA-GQA Differential Attention

First application of differential attention to Parameter Golf. CoDA-GQA (Maio, 2026) sharpens attention by subtracting a gated inhibitory noise stream, where the noise query is produced via learnable orthogonal rotation — eliminating the need for a second W_q matrix.

Controlled Ablation (8×H100 SXM, 600s, seed=1337)

	Baseline (CoDA OFF)	CoDA ON	Delta
step_avg	104.7ms	136.3ms	+30.1%
Steps	5,732	4,404	-23.2%
Sliding window bpb	1.1459	1.1580	+0.0121

Finding

CoDA trains stably from scratch without model collapse (lambda gates activate smoothly from sigmoid(-6) ≈ 0.0025). However, the 30% per-step overhead from dual SDPA outweighs the noise-cancellation benefit within a 600-second budget, resulting in 0.012 bpb worse than baseline.

The architecture likely needs the unlimited compute track (4+ hours) where the differential mechanism has time to fully activate — consistent with the original CoDA-GQA-L paper's 2-phase training protocol (2000 steps differential + 600 steps memory).

How CoDA Works

q_noise = PairwiseRotate(q, theta)          # orthogonal rotation (~0 params)
out_sig = Attn(q, k, v)                    # signal (same KV)
out_noise = Attn(q_noise, k, v)            # noise (same KV)
lambda = sigmoid(Linear(x; bias=-6))       # input-dependent gate
out = RMSNorm(out_sig - lambda * out_noise)

Parameter overhead: ~48K params (0.2% of 27M model). No second W_q matrix needed.

Credits

CoDA-GQA-L: Maio, 2026 (Zenodo DOI: 10.5281/zenodo.18804610)
Differential attention concept: Ye et al., 2024 (arXiv:2410.05258)
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush

First differential attention submission in the competition. CoDA-GQA (Maio, 2026): orthogonal rotation produces noise query from signal query (no second W_q), dual SDPA with same KV, input-dependent lambda gating, smooth on-ramp initialization. ~48K extra params (0.2% overhead). Compatible with FA3. Answers OpenAI's interest in novel attention mechanisms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

q layout is [B, T, H, D], not [B, H, T, D]. cos_t broadcast: [None, :, None, :] → [None, None, :, :] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Controlled A/B: - Baseline (CoDA OFF): 1.1459 bpb, 5732 steps, 104.7ms/step - CoDA ON: 1.1580 bpb, 4404 steps, 136.3ms/step - Delta: +0.012 bpb (worse) due to 30% step overhead CoDA trains stably but the dual-SDPA cost outweighs the noise-cancellation benefit within 600 seconds. Needs unlimited compute track for the differential mechanism to fully activate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new non-record Parameter Golf submission that implements CoDA-GQA differential attention (signal attention minus gated “noise” attention from an orthogonally rotated query), along with the training script snapshot and submission metadata.

Changes:

Introduces CoDA-GQA differential attention into the model’s self-attention path (rotated noise query + learned lambda gate).
Adds/updates evaluation utilities (incl. sliding-window eval and optional n-gram backoff eval) and export/quantization flow for the submission artifact.
Adds submission metadata + README explaining the method, plus an ablation baseline log.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.

File	Description
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py	Main training/eval/export script for the CoDA-GQA submission
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/submission.json	Leaderboard metadata for the non-record submission
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/ablation_baseline_8xH100.log	Baseline run log for the ablation comparison
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/README.md	Method + reproduction notes for the submission

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T05:23:35Z

records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py

+  log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
+  log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")


Similarly, final_int6_roundtrip_exact is printed again with the stride-64 sliding-window metrics (sw64_val_*), which can overwrite/confuse the true roundtrip metric in downstream parsing. Recommend removing or renaming this line to a distinct sliding-window tag.

Copilot · 2026-03-27T05:23:35Z

records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py

+ window_starts = [ws for ws in range(0, total_tokens, stride)
+      if min(ws + seq_len, total_tokens) - ws >= 1]


eval_val_sliding includes window starts all the way to total_tokens, which creates many tail windows where end = min(ws + seq_len, total_tokens) clamps to total_tokens. With the later rule of scoring only the last stride tokens per window, this ends up scoring the final stride tokens multiple times across those tail windows and biases the metric. Consider generating only full-length windows (plus at most one final window for the remainder) and choosing the scored region for the final window so each token contributes exactly once.

Suggested change

window_starts = [ws for ws in range(0, total_tokens, stride)

if min(ws + seq_len, total_tokens) - ws >= 1]

# Construct window starts so that we have only full-length windows

# (of length seq_len), plus at most one final window that reaches

# the end of the sequence. This avoids creating many tail windows

# that all clamp their end to total_tokens and would otherwise

# score the final stride tokens multiple times.

if total_tokens <= seq_len:

window_starts = [0] if total_tokens > 0 else []

else:

last_full_start = total_tokens - seq_len

# Full windows starting every `stride` tokens

window_starts = list(range(0, last_full_start + 1, stride))

# If the last full window start is not aligned with `stride`,

# add one extra window to cover the tail once.

if window_starts and window_starts[-1] != last_full_start:

window_starts.append(last_full_start)

Copilot · 2026-03-27T05:23:36Z

records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py

+ window_starts = sorted([ws for ws in range(0, total_tokens, stride)
+       if min(ws + seq_len, total_tokens) - ws >= 1])


eval_val_ngram_backoff builds window_starts up to total_tokens (including tail windows whose end clamps to total_tokens). Because scoring later is done on s = max(wlen - stride, 0), the last stride tokens can be scored repeatedly across multiple tail windows, biasing the reported metric and cache statistics. Recommend using the same “score each token once” window construction as the sliding-window evaluator (full windows + one remainder window with a shorter scored suffix).

Suggested change

window_starts = sorted([ws for ws in range(0, total_tokens, stride)

if min(ws + seq_len, total_tokens) - ws >= 1])

# Build evaluation windows so that each token is scored at most once:

# full windows advanced by `stride`, plus at most one remainder window.

window_starts: list[int] = []

if total_tokens > 0:

# Number of targets to score is `total_tokens` (positions [0, total_tokens - 1]).

# If we have fewer tokens than a full sequence length, just use a single window.

if total_tokens <= seq_len:

window_starts = [0]

else:

# Full windows: ws such that ws + seq_len <= total_tokens, stepped by `stride`.

last_full_start = total_tokens - seq_len

full_starts = list(range(0, last_full_start + 1, stride))

window_starts.extend(full_starts)

last_full_end = full_starts[-1] + seq_len if full_starts else 0

# If there is a tail after the last full window, add a single remainder window.

if last_full_end < total_tokens:

remainder_start = max(total_tokens - seq_len, 0)

if not full_starts or remainder_start > full_starts[-1]:

window_starts.append(remainder_start)

Copilot · 2026-03-27T05:23:36Z

records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py

+  log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
+  log0(f"final_int6_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
+ if args.eval_stride != 64 and 64 < sw_seq_len:


The log tag final_int6_roundtrip_exact is printed a second time using the sliding-window metrics (sw_val_*). This makes the logs ambiguous for anyone parsing metrics (and it’s inconsistent with the earlier final_int6_roundtrip_exact that reports the actual roundtrip eval). Consider removing this duplicate line or renaming it to a sliding-window-specific tag.

anthony-maio and others added 3 commits March 26, 2026 23:16

Fix CoDA pairwise rotation broadcast shape

1b7f385

q layout is [B, T, H, D], not [B, H, T, D]. cos_t broadcast: [None, :, None, :] → [None, None, :, :] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 27, 2026 04:33

Copilot started reviewing on behalf of anthony-maio March 27, 2026 05:13 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)#932

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)#932
anthony-maio wants to merge 3 commits intoopenai:mainfrom
anthony-maio:submission/coda-gqa-parameter-golf

anthony-maio commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
		log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")

		window_starts = [ws for ws in range(0, total_tokens, stride)
		if min(ws + seq_len, total_tokens) - ws >= 1]

- window_starts = [ws for ws in range(0, total_tokens, stride)
-      if min(ws + seq_len, total_tokens) - ws >= 1]
+ # Construct window starts so that we have only full-length windows
+ # (of length seq_len), plus at most one final window that reaches
+ # the end of the sequence. This avoids creating many tail windows
+ # that all clamp their end to total_tokens and would otherwise
+ # score the final stride tokens multiple times.
+ if total_tokens <= seq_len:
+  window_starts = [0] if total_tokens > 0 else []
+ else:
+  last_full_start = total_tokens - seq_len
+  # Full windows starting every `stride` tokens
+  window_starts = list(range(0, last_full_start + 1, stride))
+  # If the last full window start is not aligned with `stride`,
+  # add one extra window to cover the tail once.
+  if window_starts and window_starts[-1] != last_full_start:
+   window_starts.append(last_full_start)

		window_starts = sorted([ws for ws in range(0, total_tokens, stride)
		if min(ws + seq_len, total_tokens) - ws >= 1])

- window_starts = sorted([ws for ws in range(0, total_tokens, stride)
-       if min(ws + seq_len, total_tokens) - ws >= 1])
+ # Build evaluation windows so that each token is scored at most once:
+ # full windows advanced by `stride`, plus at most one remainder window.
+ window_starts: list[int] = []
+ if total_tokens > 0:
+  # Number of targets to score is `total_tokens` (positions [0, total_tokens - 1]).
+  # If we have fewer tokens than a full sequence length, just use a single window.
+  if total_tokens <= seq_len:
+   window_starts = [0]
+  else:
+   # Full windows: ws such that ws + seq_len <= total_tokens, stepped by `stride`.
+   last_full_start = total_tokens - seq_len
+   full_starts = list(range(0, last_full_start + 1, stride))
+   window_starts.extend(full_starts)
+   last_full_end = full_starts[-1] + seq_len if full_starts else 0
+   # If there is a tail after the last full window, add a single remainder window.
+   if last_full_end < total_tokens:
+    remainder_start = max(total_tokens - seq_len, 0)
+    if not full_starts or remainder_start > full_starts[-1]:
+     window_starts.append(remainder_start)

Conversation

anthony-maio commented Mar 27, 2026

Non-Record Submission: CoDA-GQA Differential Attention

Controlled Ablation (8×H100 SXM, 600s, seed=1337)

Finding

How CoDA Works

Credits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants