Skip to content

16L XSA-all + INT4 MLP QAT + GPTQ + EMA + Partial RoPE + 30ep Cosine TTT#951

Open
Bharath-970 wants to merge 10 commits intoopenai:mainfrom
Bharath-970:int4-mlp-11l-qat
Open

16L XSA-all + INT4 MLP QAT + GPTQ + EMA + Partial RoPE + 30ep Cosine TTT#951
Bharath-970 wants to merge 10 commits intoopenai:mainfrom
Bharath-970:int4-mlp-11l-qat

Conversation

@Bharath-970
Copy link
Copy Markdown

Submission: 16L XSA-all + INT4 MLP QAT + GPTQ-lite + EMA + Partial RoPE + 30ep Cosine TTT

val_bpb: pending 8×H100 run (expected ~1.02–1.05)

Techniques

Technique Detail
XSA-all All 16 layers share a single KV set from layer 0. Saves ~1.3MB vs XSA6, funding 2 extra layers
INT4 nibble MLP + QAT MLP weights packed to 4-bit with STE quantization-aware training
INT6 attention Attention weights quantized to INT6
GPTQ-lite clip search Per-channel clip search for better quantization
EMA Exponential moving average of weights, decay=0.999
Partial RoPE RoPE applied to 25% of head dims only
Bigram hash 20480-entry bigram hash embedding (dim=128)
Trigram hash 10240-entry trigram hash embedding (dim=64)
LeakyReLU(0.5)² MLP activation: leaky_relu(x, 0.5)²
SmearGate Residual mix gating
Muon + WD Muon optimizer with weight decay=0.04
U-Net skips Skip connections between encoder/decoder layers
30ep Cosine TTT Full-model AdamW on val tokens for 30 epochs, cosine LR decay from 5e-4, per-layer LR (3× MLP-out, 0.5× MLP-in)
Sliding window eval stride=32 after TTT

Architecture

  • 16 layers, 512 dim, 8 heads / 4 KV heads, MLP mult=3.5
  • Warmdown 5000 iters, 20k total steps, train seq len=2048

Files

  • records/track_10min_16mb/2026-03-25_16L_XSAall_GPTQ_EMA_PartialRoPE_TTT/train_gpt.py
  • records/track_10min_16mb/2026-03-25_16L_XSAall_GPTQ_EMA_PartialRoPE_TTT/submission.json

- XSA4: last 4 layers share K,V projections (layers 8-11 share from 4-7)
- GPTQ-lite: per-row clip ratio grid search [0.80..1.00] minimizing MSE
- EMA decay=0.999 weight averaging
- Partial RoPE: apply RoPE to only 25% of head dims
- Bigram20480 + Trigram5120 hash embeddings
- warmdown_iters=3500
- 13 layers (vs 12), mlp_mult=3.5 (vs 3.0), warmdown=4000
- XSA4: last 4 layers share K,V projections
- GPTQ-lite clip search, EMA decay=0.999, Partial RoPE 25%
- Bigram20480 + Trigram5120, still well within 16MB limit
Score-first TTT LoRA (rank=8) on top of 14L XSA6 base:
LeakyReLU(0.5)^2, Bigram20480, Trigram10240, warmdown5000.
Update run_modal.py to point to new submission.
…ssion

Swap score-first LoRA TTT for the simpler and more effective cosine TTT
approach from PR openai#672 (1.0781 BPB): fine-tune all model weights on val
data for 30 epochs with cosine LR decay and per-layer LR groups (3x
MLP-out, 0.5x MLP-in), followed by sliding-window stride=64 eval.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant