feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%) by claylo · Pull Request #3 · rohangpta/ctoc

claylo · 2026-02-12T03:13:03Z

Summary

Extends the Claude 3+ token vocabulary from 36,495 → 38,360 (+5.1%) using three phases of sandwich-method API probing
Adds Python tooling for vocabulary probing, candidate generation, and accuracy measurement
Documents methodology, findings, and a key accuracy insight in REPORT-ADDENDUM.md

Methodology

Three phases of probing against the count_tokens API using the sandwich method (count("§" + text + "§") - count("§§")), with async HTTP/2 at ~2K RPM:

Phase	Candidates	New tokens	Hit rate	Time
Re-check `checked` list	275,351	748	0.27%	141 min
Case/space/tiktoken xref	79,535	1,038	1.3%	41 min
Keywords/Unicode/emoji	130,234	79	0.06%	67 min
Total	485,120	1,865	0.38%	249 min

Accuracy

Greedy longest-match accuracy measured across 7 files (24.6K API tokens): 98.5%

Overcounting (safe direction for budget enforcement) on all file types except markdown tables, where BPE merge-order divergence causes slight undercounting (101.1% on table-heavy content). This is an inherent property of greedy-vs-BPE, not a vocabulary issue — documented in REPORT-ADDENDUM.md with mitigation strategy for downstream consumers.

Test plan

All 38,360 tokens individually verified as single tokens via sandwich method
Accuracy measured on C++ (99.0%), Markdown (98.5%), and Python (96.7-97.4%) files
Non-hit candidates added to checked to prevent redundant probing

Three-phase sandwich-method re-probing of the Claude count_tokens API using async HTTP/2 at ~2K RPM (Tier 2): - Phase 1: Re-checked 275,351 previously-tested candidates, recovering 748 tokens misclassified by the pre-sandwich baseline - Phase 2: Generated 79,535 new candidates from case variants, space-prefix variants, and tiktoken cl100k/o200k cross-reference; found 1,038 new tokens - Phase 3: Probed 130,234 candidates from TextMate grammar keywords, Unicode block sweeps, and emoji sequences; found 79 new tokens Greedy longest-match accuracy: 98.5% across 7 test files (24.6K API tokens). Undercounting (>100%) occurs only on markdown table syntax due to BPE merge-order divergence — not a vocabulary issue. Includes Python tooling for probing, candidate generation, and accuracy measurement. All 484,544 probed candidates tracked in vocab.json checked list to prevent redundant API calls.

rohangpta · 2026-02-28T05:22:49Z

Thank you for the patch! I cherry picked the vocab into main: 5aec316

Source: rohangpta/ctoc#3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rohangpta added a commit that referenced this pull request Feb 28, 2026

feat: import verified vocab expansion from PR #3

5aec316

eordano added a commit to eordano/tokencount that referenced this pull request Feb 28, 2026

feat: update Claude vocabulary from 36,495 to 38,360 tokens

54154b0

Source: rohangpta/ctoc#3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3

feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3
claylo wants to merge 1 commit intorohangpta:mainfrom
claylo:extended-vocab

claylo commented Feb 12, 2026

Uh oh!

rohangpta commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claylo commented Feb 12, 2026

Summary

Methodology

Accuracy

Test plan

Uh oh!

rohangpta commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants