Skip to content

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)#932

Open
anthony-maio wants to merge 3 commits intoopenai:mainfrom
anthony-maio:submission/coda-gqa-parameter-golf
Open

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)#932
anthony-maio wants to merge 3 commits intoopenai:mainfrom
anthony-maio:submission/coda-gqa-parameter-golf

Conversation

@anthony-maio
Copy link
Copy Markdown

Non-Record Submission: CoDA-GQA Differential Attention

First application of differential attention to Parameter Golf. CoDA-GQA (Maio, 2026) sharpens attention by subtracting a gated inhibitory noise stream, where the noise query is produced via learnable orthogonal rotation — eliminating the need for a second W_q matrix.

Controlled Ablation (8×H100 SXM, 600s, seed=1337)

Baseline (CoDA OFF) CoDA ON Delta
step_avg 104.7ms 136.3ms +30.1%
Steps 5,732 4,404 -23.2%
Sliding window bpb 1.1459 1.1580 +0.0121

Finding

CoDA trains stably from scratch without model collapse (lambda gates activate smoothly from sigmoid(-6) ≈ 0.0025). However, the 30% per-step overhead from dual SDPA outweighs the noise-cancellation benefit within a 600-second budget, resulting in 0.012 bpb worse than baseline.

The architecture likely needs the unlimited compute track (4+ hours) where the differential mechanism has time to fully activate — consistent with the original CoDA-GQA-L paper's 2-phase training protocol (2000 steps differential + 600 steps memory).

How CoDA Works

q_noise = PairwiseRotate(q, theta)          # orthogonal rotation (~0 params)
out_sig = Attn(q, k, v)                    # signal (same KV)
out_noise = Attn(q_noise, k, v)            # noise (same KV)
lambda = sigmoid(Linear(x; bias=-6))       # input-dependent gate
out = RMSNorm(out_sig - lambda * out_noise)

Parameter overhead: ~48K params (0.2% of 27M model). No second W_q matrix needed.

Credits

anthony-maio and others added 3 commits March 26, 2026 23:16
First differential attention submission in the competition.
CoDA-GQA (Maio, 2026): orthogonal rotation produces noise query
from signal query (no second W_q), dual SDPA with same KV,
input-dependent lambda gating, smooth on-ramp initialization.

~48K extra params (0.2% overhead). Compatible with FA3.
Answers OpenAI's interest in novel attention mechanisms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
q layout is [B, T, H, D], not [B, H, T, D].
cos_t broadcast: [None, :, None, :] → [None, None, :, :]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Controlled A/B:
- Baseline (CoDA OFF): 1.1459 bpb, 5732 steps, 104.7ms/step
- CoDA ON: 1.1580 bpb, 4404 steps, 136.3ms/step
- Delta: +0.012 bpb (worse) due to 30% step overhead

CoDA trains stably but the dual-SDPA cost outweighs the
noise-cancellation benefit within 600 seconds. Needs unlimited
compute track for the differential mechanism to fully activate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 27, 2026 04:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record Parameter Golf submission that implements CoDA-GQA differential attention (signal attention minus gated “noise” attention from an orthogonally rotated query), along with the training script snapshot and submission metadata.

Changes:

  • Introduces CoDA-GQA differential attention into the model’s self-attention path (rotated noise query + learned lambda gate).
  • Adds/updates evaluation utilities (incl. sliding-window eval and optional n-gram backoff eval) and export/quantization flow for the submission artifact.
  • Adds submission metadata + README explaining the method, plus an ablation baseline log.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.

File Description
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py Main training/eval/export script for the CoDA-GQA submission
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/submission.json Leaderboard metadata for the non-record submission
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/ablation_baseline_8xH100.log Baseline run log for the ablation comparison
records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/README.md Method + reproduction notes for the submission

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1604 to +1605
log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, final_int6_roundtrip_exact is printed again with the stride-64 sliding-window metrics (sw64_val_*), which can overwrite/confuse the true roundtrip metric in downstream parsing. Recommend removing or renaming this line to a distinct sliding-window tag.

Copilot uses AI. Check for mistakes.
Comment on lines +892 to +893
window_starts = [ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_sliding includes window starts all the way to total_tokens, which creates many tail windows where end = min(ws + seq_len, total_tokens) clamps to total_tokens. With the later rule of scoring only the last stride tokens per window, this ends up scoring the final stride tokens multiple times across those tail windows and biases the metric. Consider generating only full-length windows (plus at most one final window for the remainder) and choosing the scored region for the final window so each token contributes exactly once.

Suggested change
window_starts = [ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1]
# Construct window starts so that we have only full-length windows
# (of length seq_len), plus at most one final window that reaches
# the end of the sequence. This avoids creating many tail windows
# that all clamp their end to total_tokens and would otherwise
# score the final stride tokens multiple times.
if total_tokens <= seq_len:
window_starts = [0] if total_tokens > 0 else []
else:
last_full_start = total_tokens - seq_len
# Full windows starting every `stride` tokens
window_starts = list(range(0, last_full_start + 1, stride))
# If the last full window start is not aligned with `stride`,
# add one extra window to cover the tail once.
if window_starts and window_starts[-1] != last_full_start:
window_starts.append(last_full_start)

Copilot uses AI. Check for mistakes.
Comment on lines +1001 to +1002
window_starts = sorted([ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1])
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_ngram_backoff builds window_starts up to total_tokens (including tail windows whose end clamps to total_tokens). Because scoring later is done on s = max(wlen - stride, 0), the last stride tokens can be scored repeatedly across multiple tail windows, biasing the reported metric and cache statistics. Recommend using the same “score each token once” window construction as the sliding-window evaluator (full windows + one remainder window with a shorter scored suffix).

Suggested change
window_starts = sorted([ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1])
# Build evaluation windows so that each token is scored at most once:
# full windows advanced by `stride`, plus at most one remainder window.
window_starts: list[int] = []
if total_tokens > 0:
# Number of targets to score is `total_tokens` (positions [0, total_tokens - 1]).
# If we have fewer tokens than a full sequence length, just use a single window.
if total_tokens <= seq_len:
window_starts = [0]
else:
# Full windows: ws such that ws + seq_len <= total_tokens, stepped by `stride`.
last_full_start = total_tokens - seq_len
full_starts = list(range(0, last_full_start + 1, stride))
window_starts.extend(full_starts)
last_full_end = full_starts[-1] + seq_len if full_starts else 0
# If there is a tail after the last full window, add a single remainder window.
if last_full_end < total_tokens:
remainder_start = max(total_tokens - seq_len, 0)
if not full_starts or remainder_start > full_starts[-1]:
window_starts.append(remainder_start)

Copilot uses AI. Check for mistakes.
Comment on lines +1588 to +1590
log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
log0(f"final_int6_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
if args.eval_stride != 64 and 64 < sw_seq_len:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log tag final_int6_roundtrip_exact is printed a second time using the sliding-window metrics (sw_val_*). This makes the logs ambiguous for anyone parsing metrics (and it’s inconsistent with the earlier final_int6_roundtrip_exact that reports the actual roundtrip eval). Consider removing this duplicate line or renaming it to a sliding-window-specific tag.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants