Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)#932
Conversation
First differential attention submission in the competition. CoDA-GQA (Maio, 2026): orthogonal rotation produces noise query from signal query (no second W_q), dual SDPA with same KV, input-dependent lambda gating, smooth on-ramp initialization. ~48K extra params (0.2% overhead). Compatible with FA3. Answers OpenAI's interest in novel attention mechanisms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
q layout is [B, T, H, D], not [B, H, T, D]. cos_t broadcast: [None, :, None, :] → [None, None, :, :] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Controlled A/B: - Baseline (CoDA OFF): 1.1459 bpb, 5732 steps, 104.7ms/step - CoDA ON: 1.1580 bpb, 4404 steps, 136.3ms/step - Delta: +0.012 bpb (worse) due to 30% step overhead CoDA trains stably but the dual-SDPA cost outweighs the noise-cancellation benefit within 600 seconds. Needs unlimited compute track for the differential mechanism to fully activate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new non-record Parameter Golf submission that implements CoDA-GQA differential attention (signal attention minus gated “noise” attention from an orthogonally rotated query), along with the training script snapshot and submission metadata.
Changes:
- Introduces CoDA-GQA differential attention into the model’s self-attention path (rotated noise query + learned lambda gate).
- Adds/updates evaluation utilities (incl. sliding-window eval and optional n-gram backoff eval) and export/quantization flow for the submission artifact.
- Adds submission metadata + README explaining the method, plus an ablation baseline log.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/train_gpt.py | Main training/eval/export script for the CoDA-GQA submission |
| records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/submission.json | Leaderboard metadata for the non-record submission |
| records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/ablation_baseline_8xH100.log | Baseline run log for the ablation comparison |
| records/track_non_record_16mb/2026-03-26_CoDA_GQA_DifferentialAttention/README.md | Method + reproduction notes for the submission |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") | ||
| log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") |
There was a problem hiding this comment.
Similarly, final_int6_roundtrip_exact is printed again with the stride-64 sliding-window metrics (sw64_val_*), which can overwrite/confuse the true roundtrip metric in downstream parsing. Recommend removing or renaming this line to a distinct sliding-window tag.
| window_starts = [ws for ws in range(0, total_tokens, stride) | ||
| if min(ws + seq_len, total_tokens) - ws >= 1] |
There was a problem hiding this comment.
eval_val_sliding includes window starts all the way to total_tokens, which creates many tail windows where end = min(ws + seq_len, total_tokens) clamps to total_tokens. With the later rule of scoring only the last stride tokens per window, this ends up scoring the final stride tokens multiple times across those tail windows and biases the metric. Consider generating only full-length windows (plus at most one final window for the remainder) and choosing the scored region for the final window so each token contributes exactly once.
| window_starts = [ws for ws in range(0, total_tokens, stride) | |
| if min(ws + seq_len, total_tokens) - ws >= 1] | |
| # Construct window starts so that we have only full-length windows | |
| # (of length seq_len), plus at most one final window that reaches | |
| # the end of the sequence. This avoids creating many tail windows | |
| # that all clamp their end to total_tokens and would otherwise | |
| # score the final stride tokens multiple times. | |
| if total_tokens <= seq_len: | |
| window_starts = [0] if total_tokens > 0 else [] | |
| else: | |
| last_full_start = total_tokens - seq_len | |
| # Full windows starting every `stride` tokens | |
| window_starts = list(range(0, last_full_start + 1, stride)) | |
| # If the last full window start is not aligned with `stride`, | |
| # add one extra window to cover the tail once. | |
| if window_starts and window_starts[-1] != last_full_start: | |
| window_starts.append(last_full_start) |
| window_starts = sorted([ws for ws in range(0, total_tokens, stride) | ||
| if min(ws + seq_len, total_tokens) - ws >= 1]) |
There was a problem hiding this comment.
eval_val_ngram_backoff builds window_starts up to total_tokens (including tail windows whose end clamps to total_tokens). Because scoring later is done on s = max(wlen - stride, 0), the last stride tokens can be scored repeatedly across multiple tail windows, biasing the reported metric and cache statistics. Recommend using the same “score each token once” window construction as the sliding-window evaluator (full windows + one remainder window with a shorter scored suffix).
| window_starts = sorted([ws for ws in range(0, total_tokens, stride) | |
| if min(ws + seq_len, total_tokens) - ws >= 1]) | |
| # Build evaluation windows so that each token is scored at most once: | |
| # full windows advanced by `stride`, plus at most one remainder window. | |
| window_starts: list[int] = [] | |
| if total_tokens > 0: | |
| # Number of targets to score is `total_tokens` (positions [0, total_tokens - 1]). | |
| # If we have fewer tokens than a full sequence length, just use a single window. | |
| if total_tokens <= seq_len: | |
| window_starts = [0] | |
| else: | |
| # Full windows: ws such that ws + seq_len <= total_tokens, stepped by `stride`. | |
| last_full_start = total_tokens - seq_len | |
| full_starts = list(range(0, last_full_start + 1, stride)) | |
| window_starts.extend(full_starts) | |
| last_full_end = full_starts[-1] + seq_len if full_starts else 0 | |
| # If there is a tail after the last full window, add a single remainder window. | |
| if last_full_end < total_tokens: | |
| remainder_start = max(total_tokens - seq_len, 0) | |
| if not full_starts or remainder_start > full_starts[-1]: | |
| window_starts.append(remainder_start) |
| log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") | ||
| log0(f"final_int6_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") | ||
| if args.eval_stride != 64 and 64 < sw_seq_len: |
There was a problem hiding this comment.
The log tag final_int6_roundtrip_exact is printed a second time using the sliding-window metrics (sw_val_*). This makes the logs ambiguous for anyone parsing metrics (and it’s inconsistent with the earlier final_int6_roundtrip_exact that reports the actual roundtrip eval). Consider removing this duplicate line or renaming it to a sliding-window-specific tag.
Non-Record Submission: CoDA-GQA Differential Attention
First application of differential attention to Parameter Golf. CoDA-GQA (Maio, 2026) sharpens attention by subtracting a gated inhibitory noise stream, where the noise query is produced via learnable orthogonal rotation — eliminating the need for a second W_q matrix.
Controlled Ablation (8×H100 SXM, 600s, seed=1337)
Finding
CoDA trains stably from scratch without model collapse (lambda gates activate smoothly from sigmoid(-6) ≈ 0.0025). However, the 30% per-step overhead from dual SDPA outweighs the noise-cancellation benefit within a 600-second budget, resulting in 0.012 bpb worse than baseline.
The architecture likely needs the unlimited compute track (4+ hours) where the differential mechanism has time to fully activate — consistent with the original CoDA-GQA-L paper's 2-phase training protocol (2000 steps differential + 600 steps memory).
How CoDA Works
Parameter overhead: ~48K params (0.2% of 27M model). No second W_q matrix needed.
Credits