Skip to content

feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3

Open
claylo wants to merge 1 commit intorohangpta:mainfrom
claylo:extended-vocab
Open

feat: extend vocabulary from 36,495 to 38,360 tokens (+5.1%)#3
claylo wants to merge 1 commit intorohangpta:mainfrom
claylo:extended-vocab

Conversation

@claylo
Copy link
Copy Markdown

@claylo claylo commented Feb 12, 2026

Summary

  • Extends the Claude 3+ token vocabulary from 36,495 → 38,360 (+5.1%) using three phases of sandwich-method API probing
  • Adds Python tooling for vocabulary probing, candidate generation, and accuracy measurement
  • Documents methodology, findings, and a key accuracy insight in REPORT-ADDENDUM.md

Methodology

Three phases of probing against the count_tokens API using the sandwich method (count("§" + text + "§") - count("§§")), with async HTTP/2 at ~2K RPM:

Phase Candidates New tokens Hit rate Time
Re-check checked list 275,351 748 0.27% 141 min
Case/space/tiktoken xref 79,535 1,038 1.3% 41 min
Keywords/Unicode/emoji 130,234 79 0.06% 67 min
Total 485,120 1,865 0.38% 249 min

Accuracy

Greedy longest-match accuracy measured across 7 files (24.6K API tokens): 98.5%

Overcounting (safe direction for budget enforcement) on all file types except markdown tables, where BPE merge-order divergence causes slight undercounting (101.1% on table-heavy content). This is an inherent property of greedy-vs-BPE, not a vocabulary issue — documented in REPORT-ADDENDUM.md with mitigation strategy for downstream consumers.

Test plan

  • All 38,360 tokens individually verified as single tokens via sandwich method
  • Accuracy measured on C++ (99.0%), Markdown (98.5%), and Python (96.7-97.4%) files
  • Non-hit candidates added to checked to prevent redundant probing

Three-phase sandwich-method re-probing of the Claude count_tokens API
using async HTTP/2 at ~2K RPM (Tier 2):

- Phase 1: Re-checked 275,351 previously-tested candidates, recovering
  748 tokens misclassified by the pre-sandwich baseline
- Phase 2: Generated 79,535 new candidates from case variants,
  space-prefix variants, and tiktoken cl100k/o200k cross-reference;
  found 1,038 new tokens
- Phase 3: Probed 130,234 candidates from TextMate grammar keywords,
  Unicode block sweeps, and emoji sequences; found 79 new tokens

Greedy longest-match accuracy: 98.5% across 7 test files (24.6K API
tokens). Undercounting (>100%) occurs only on markdown table syntax
due to BPE merge-order divergence — not a vocabulary issue.

Includes Python tooling for probing, candidate generation, and
accuracy measurement. All 484,544 probed candidates tracked in
vocab.json checked list to prevent redundant API calls.
rohangpta added a commit that referenced this pull request Feb 28, 2026
@rohangpta
Copy link
Copy Markdown
Owner

Thank you for the patch! I cherry picked the vocab into main: 5aec316

eordano added a commit to eordano/tokencount that referenced this pull request Feb 28, 2026
Source: rohangpta/ctoc#3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants