[Notable Non-Record Submission] 1.1090 BPB - 74.3M Ternary U-Net Transformer (100k steps/3h)#923
Conversation
…Record Leaderboard.
|
Really interesting scaling data — seeing the ternary architecture go from 1.1535 (10 min) to 1.1090 (100K steps) is exactly the kind of research this competition needs more of. The BF16 vs FP16 scale storage finding is a great catch too — that 0.039 BPB roundtrip gap at 150K steps with FP16 would've been brutal to debug without this data. The zero_frac drop (0.236 → 0.181) with extended training is fascinating — the model actively learning to use more of its ternary capacity over time. Curious whether you've looked at whether that trend continues past 100K or if it plateaus. One thing worth noting for others reading: the 1.1090 number is from a 3-hour unconstrained run, not the 10-minute track. The valid 10-min submission is #920 at 1.1539. Still, the architecture itself is one of the more creative entries in the competition. Ternary weights with U-Net skips is a direction nobody else is exploring. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
@MatoTeziTanka yep, the 150k steps/3h is part of the title, also the "notable non-record" at the beginning of the title is showcasing that this is for the other leaderboard, not the main one. Thanks for your reply! |
…_8192BPE_YaRN_NeoMuon_v2 directory as it is part of another branch/PR.
…10L_UNet_INT4FP8QAT_Brotli directory as it is part of another branch/PR.
Notable: 1.1090 BPB - 74.3M Ternary U-Net Transformer (100k steps, unconstrained)
Extended training of #640 / #641 / #920 config with SmearGate enabled
val_bpb: 1.1090 (sliding, stride=16, T=0.90) | 15.95 MB artifact | 8xH100 SXM, ~3h
Results
Extended training reduces zero_frac (0.236 -> 0.181) as the model utilises more of its ternary weight capacity. RT gap grows slightly (0.0006 -> 0.0022) due to the shrinkage correction amplification at longer training, but remains well-controlled with BF16 scale storage.
Why BF16 scales matter for extended training
Ternary dequantization applies a shrinkage correction
1/(1-zero_frac)to compensate for zeros reducing the group mean. FP16 scale storage introduces rounding error that gets multiplied by this factor. As training progresses and zero_frac changes, the amplification grows:The practical impact - FP16 vs BF16 scale storage at different training lengths:
Without the changes applied, this extended run would have produced a 0.03+ BPB roundtrip gap, making the artifact unusable. The changes cost zero bytes and keep the gap at 0.0022 even at 100k steps.
Changes from #940
SMEAR=1): learnable per-block gating for residual smoothing. Adds minimal params, provides small quality benefit at extended training.MAX_WALLCLOCK_SECONDS=0)Architecture, quantisation, compression, and all other hyperparameters identical to #940.
Setup and Run