feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3
Open
claylo wants to merge 1 commit intorohangpta:mainfrom
Open
feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3claylo wants to merge 1 commit intorohangpta:mainfrom
claylo wants to merge 1 commit intorohangpta:mainfrom
Conversation
Three-phase sandwich-method re-probing of the Claude count_tokens API using async HTTP/2 at ~2K RPM (Tier 2): - Phase 1: Re-checked 275,351 previously-tested candidates, recovering 748 tokens misclassified by the pre-sandwich baseline - Phase 2: Generated 79,535 new candidates from case variants, space-prefix variants, and tiktoken cl100k/o200k cross-reference; found 1,038 new tokens - Phase 3: Probed 130,234 candidates from TextMate grammar keywords, Unicode block sweeps, and emoji sequences; found 79 new tokens Greedy longest-match accuracy: 98.5% across 7 test files (24.6K API tokens). Undercounting (>100%) occurs only on markdown table syntax due to BPE merge-order divergence — not a vocabulary issue. Includes Python tooling for probing, candidate generation, and accuracy measurement. All 484,544 probed candidates tracked in vocab.json checked list to prevent redundant API calls.
rohangpta
added a commit
that referenced
this pull request
Feb 28, 2026
Owner
|
Thank you for the patch! I cherry picked the vocab into main: 5aec316 |
eordano
added a commit
to eordano/tokencount
that referenced
this pull request
Feb 28, 2026
Source: rohangpta/ctoc#3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Methodology
Three phases of probing against the
count_tokensAPI using the sandwich method (count("§" + text + "§") - count("§§")), with async HTTP/2 at ~2K RPM:checkedlistAccuracy
Greedy longest-match accuracy measured across 7 files (24.6K API tokens): 98.5%
Overcounting (safe direction for budget enforcement) on all file types except markdown tables, where BPE merge-order divergence causes slight undercounting (101.1% on table-heavy content). This is an inherent property of greedy-vs-BPE, not a vocabulary issue — documented in REPORT-ADDENDUM.md with mitigation strategy for downstream consumers.
Test plan
checkedto prevent redundant probing