16L XSA-all + INT4 MLP QAT + GPTQ + EMA + Partial RoPE + 30ep Cosine TTT by Bharath-970 · Pull Request #951 · openai/parameter-golf

Bharath-970 · 2026-03-27T12:39:35Z

Submission: 16L XSA-all + INT4 MLP QAT + GPTQ-lite + EMA + Partial RoPE + 30ep Cosine TTT

val_bpb: pending 8×H100 run (expected ~1.02–1.05)

Techniques

Technique	Detail
XSA-all	All 16 layers share a single KV set from layer 0. Saves ~1.3MB vs XSA6, funding 2 extra layers
INT4 nibble MLP + QAT	MLP weights packed to 4-bit with STE quantization-aware training
INT6 attention	Attention weights quantized to INT6
GPTQ-lite clip search	Per-channel clip search for better quantization
EMA	Exponential moving average of weights, decay=0.999
Partial RoPE	RoPE applied to 25% of head dims only
Bigram hash	20480-entry bigram hash embedding (dim=128)
Trigram hash	10240-entry trigram hash embedding (dim=64)
LeakyReLU(0.5)²	MLP activation: leaky_relu(x, 0.5)²
SmearGate	Residual mix gating
Muon + WD	Muon optimizer with weight decay=0.04
U-Net skips	Skip connections between encoder/decoder layers
30ep Cosine TTT	Full-model AdamW on val tokens for 30 epochs, cosine LR decay from 5e-4, per-layer LR (3× MLP-out, 0.5× MLP-in)
Sliding window eval	stride=32 after TTT

Architecture

16 layers, 512 dim, 8 heads / 4 KV heads, MLP mult=3.5
Warmdown 5000 iters, 20k total steps, train seq len=2048

Files

records/track_10min_16mb/2026-03-25_16L_XSAall_GPTQ_EMA_PartialRoPE_TTT/train_gpt.py
records/track_10min_16mb/2026-03-25_16L_XSAall_GPTQ_EMA_PartialRoPE_TTT/submission.json

- XSA4: last 4 layers share K,V projections (layers 8-11 share from 4-7) - GPTQ-lite: per-row clip ratio grid search [0.80..1.00] minimizing MSE - EMA decay=0.999 weight averaging - Partial RoPE: apply RoPE to only 25% of head dims - Bigram20480 + Trigram5120 hash embeddings - warmdown_iters=3500

- 13 layers (vs 12), mlp_mult=3.5 (vs 3.0), warmdown=4000 - XSA4: last 4 layers share K,V projections - GPTQ-lite clip search, EMA decay=0.999, Partial RoPE 25% - Bigram20480 + Trigram5120, still well within 16MB limit

Score-first TTT LoRA (rank=8) on top of 14L XSA6 base: LeakyReLU(0.5)^2, Bigram20480, Trigram10240, warmdown5000. Update run_modal.py to point to new submission.

…ssion Swap score-first LoRA TTT for the simpler and more effective cosine TTT approach from PR openai#672 (1.0781 BPB): fine-tune all model weights on val data for 30 epochs with cosine LR decay and per-layer LR groups (3x MLP-out, 0.5x MLP-in), followed by sliding-window stride=64 eval.

Bharath-970 added 10 commits March 23, 2026 01:12

Add Int4 nibble MLP + 11L + QAT submission

cb41ae3

Update submission.json with GitHub username

269b9f3

Add 12L EMA+PartialRoPE+Trigram and 12L+Trigram submissions

9c2af48

Add 13L XSA4+GPTQ-lite+EMA+PartialRoPE submission

658a94f

- 13 layers (vs 12), mlp_mult=3.5 (vs 3.0), warmdown=4000 - XSA4: last 4 layers share K,V projections - GPTQ-lite clip search, EMA decay=0.999, Partial RoPE 25% - Bigram20480 + Trigram5120, still well within 16MB limit

Add 14L XSA6+GPTQ+EMA+PartialRoPE submission

9d5b13d

Switch MLP activation to LeakyReLU(0.5)² in 14L submission

03044cf

Add 14L XSA6+GPTQ+EMA+PartialRoPE+TTT submission

17f42e9

Score-first TTT LoRA (rank=8) on top of 14L XSA6 base: LeakyReLU(0.5)^2, Bigram20480, Trigram10240, warmdown5000. Update run_modal.py to point to new submission.

Add 16L XSA-all+GPTQ+EMA+PartialRoPE+TTT submission

5825338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16L XSA-all + INT4 MLP QAT + GPTQ + EMA + Partial RoPE + 30ep Cosine TTT#951

16L XSA-all + INT4 MLP QAT + GPTQ + EMA + Partial RoPE + 30ep Cosine TTT#951
Bharath-970 wants to merge 10 commits intoopenai:mainfrom
Bharath-970:int4-mlp-11l-qat

Bharath-970 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bharath-970 commented Mar 27, 2026

Submission: 16L XSA-all + INT4 MLP QAT + GPTQ-lite + EMA + Partial RoPE + 30ep Cosine TTT

Techniques

Architecture

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant