feat: compile-time trie tokenizer, mmap file reads by eordano · Pull Request #4 · rohangpta/ctoc

eordano · 2026-03-01T21:59:49Z

Added a double-array trie to replace the runtime heap trie. This has less initialization time than vector maps, and it's already in the memory-mapped binary. The array lookups are more cache-friendly too. I added some other optimizations, trying to squeeze the most juice we can (mmap file reads save one memory copy, -c opt for -O2). Benchmarked with hyperfine -N --warmup 10 --runs 500 on https://www.gutenberg.org/cache/epub/2600/pg2600.txt on an average work laptop:

                   optimized     main
  1 B              668 µs        121.1 ms   (181x faster)
  100 B            697 µs        121.5 ms   (174x faster)
  10 KB            778 µs        124.4 ms   (160x faster)
  1 MB             8.7 ms        336.8 ms   (38.7x faster)

The gap at small inputs is explained by startup cost: ~121 ms of initialization cost of building a heap trie from the 38k tokens. This branch is ready to go from the load of the binary. At 1 MB where tokenization dominates, mmap and better cache locality still yield a ~39x speedup.

The mmap alone provides a ~2.4x speedup compared against ifstream (measured with the new trie):

             mmap        ifstream
  1 MB       8.5 ms      20.5 ms    (2.4x faster)

Note: I didn't find an easy way to run Bazel 9 (I'm on 8), so I didn't update MODULE.bazel.lock to avoid noise from downgrading.

Thanks for this project! It inspired me to build an online visualization for tokenizing things using it at https://github.com/eordano/tokencount (live at https://tokencount.eordano.com/)

gen_vocab.py builds a double-array tree from the vocabulary at code-gen time. At runtime there is nothing to initialize — just two static uint32 arrays and a greedy longest-match loop. File reads use mmap instead of ifstream. Benchmark (1MB of War and Peace, hyperfine -N, 500 runs): optimized upstream 1 B 260 µs 91 ms (350×) 1 MB 5.6 ms 258 ms (46×) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rohangpta · 2026-03-05T07:05:26Z

Nice website and thank you for the patch! I will likely accept this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compile-time trie tokenizer, mmap file reads#4

feat: compile-time trie tokenizer, mmap file reads#4
eordano wants to merge 1 commit intorohangpta:mainfrom
eordano:main

eordano commented Mar 1, 2026 •

edited

Loading

Uh oh!

rohangpta commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eordano commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohangpta commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eordano commented Mar 1, 2026 •

edited

Loading