Skip to content

Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed)#948

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:submission/nathanmaine-dirichlet-ngram
Open

Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed)#948
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:submission/nathanmaine-dirichlet-ngram

Conversation

@dentity007
Copy link
Copy Markdown

Two-Level Dirichlet Posterior + Per-Order OBCL + Phrase Cache

val_bpb: 0.11556 (3-seed mean, std 0.0000057) | ~15.1 MB | 8xH100 SXM

3-seed validation

Seed Val BPB Eval Time Artifact bytes
1337 0.11555061 419s 15,077,877
42 0.11556435 370s 15,077,877
2025 0.11555875 359s 15,077,877
Mean 0.11556 (std 0.0000057)

Techniques

  • Two-level Dirichlet-Multinomial posterior mixing (neural → n-gram → phrase)
  • Per-order OBCL concentrations: [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86]
  • Phrase suffix matching at probe lengths [20, 16] with Dirichlet concentration 1.0
  • 15-gram backoff (orders 2-15, 4M hash buckets)
  • Complementary training (alpha=0.50, orders 2-5)
  • EBLS architecture (3 shared x 3 loops + 2 unique = 11L)
  • GPTQ int6 + LZMA compression
  • EMA 0.997 + SWA weight averaging

Compliance

  • Training: 560s on 8xH100 (within 600s)
  • Eval: 419s worst case (within 600s)
  • Artifact: 15,077,877 bytes (within 16,000,000)
  • All caches strictly backward-looking (causal)
  • Score-first evaluation
  • No training data accessed during evaluation

Credits

Built on the community's work:

@dentity007 dentity007 changed the title Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) Mar 27, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed)

BPB: 0.11556 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA 4be498f52598, file records/track_10min_16mb/2026-03-27_Dirichlet_Ngram_Phrase_Cache/train_gpt.py):

The n-gram lookup key at line 875 is constructed by XOR-ing the target token into the hash:

line 875: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 875 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=87629 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=87629 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Proactive compliance documentation while awaiting maintainer ruling on
hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886.

No code changes. Just README documenting:
- The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing
  via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet
  formula validity)
- What this submission does (backward-looking causal n-gram cache with
  Dirichlet-Multinomial smoothing)
- What it does NOT do (no training on val_tokens, no backward passes,
  model frozen during eval)
- Explicit statement that I asked on Issue openai#402 on April 2 and will
  retract if ruled invalid

Distinct from the TTT-on-val class of violations I retracted in PR openai#1193,
PR openai#406, and PR openai#1127.
dentity007 added a commit to NathanMaine/parameter-golf that referenced this pull request Apr 13, 2026
Same approach as PR openai#948 compliance note. This submission extends openai#948
with order-20 backoff but uses the same eval-time hash n-gram cache
architecture under the same community dispute (Issue openai#402, Issue openai#677,
PR openai#886, PR openai#900).

No code changes. README documents:
- The open dispute and relevant threads
- What this submission does (causal backward-looking cache, Dirichlet
  smoothing, model frozen)
- What it does NOT do (no training on val_tokens, no backward passes)
- Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127
- Will retract if maintainers rule the class invalid
@dentity007
Copy link
Copy Markdown
Author

Compliance note added (April 13, 2026)

Pushed a README update (commit 2694ae5) with a proactive compliance section documenting the open dispute around hash-based eval-time n-gram caches.

Short version: this submission is NOT in the TTT-on-val class (@MatoTeziTanka just flagged my PR #1193, #406, and #1127 for that and I have retracted all three). This is a different class where the neural model stays frozen but a hash-based n-gram cache is built causally from already-scored tokens and blended with the model softmax via Dirichlet-Multinomial smoothing.

The dispute is still open across Issue #402, Issue #677, PR #886 (@abaybektursun arguing hash collisions invalidate the input counts) and PR #900 (@Robert-Sneiderman arguing the Dirichlet formula normalizes regardless). @valerio-oai said 'leaning toward legal' on 2026-03-26 but no final ruling.

I asked about this submission specifically on Issue #402 on 2026-04-02 and there has been no maintainer response since. I am leaving this PR open pending an official ruling. If the class is ruled invalid, I will retract and close.

See the updated README for the full writeup of what this submission does and does not do, cross-referenced with the dispute threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants