Skip to content

feat: compile-time trie tokenizer, mmap file reads#4

Open
eordano wants to merge 1 commit intorohangpta:mainfrom
eordano:main
Open

feat: compile-time trie tokenizer, mmap file reads#4
eordano wants to merge 1 commit intorohangpta:mainfrom
eordano:main

Conversation

@eordano
Copy link
Copy Markdown

@eordano eordano commented Mar 1, 2026

Added a double-array trie to replace the runtime heap trie. This has less initialization time than vector maps, and it's already in the memory-mapped binary. The array lookups are more cache-friendly too. I added some other optimizations, trying to squeeze the most juice we can (mmap file reads save one memory copy, -c opt for -O2). Benchmarked with hyperfine -N --warmup 10 --runs 500 on https://www.gutenberg.org/cache/epub/2600/pg2600.txt on an average work laptop:

                   optimized     main
  1 B              668 µs        121.1 ms   (181x faster)
  100 B            697 µs        121.5 ms   (174x faster)
  10 KB            778 µs        124.4 ms   (160x faster)
  1 MB             8.7 ms        336.8 ms   (38.7x faster)

The gap at small inputs is explained by startup cost: ~121 ms of initialization cost of building a heap trie from the 38k tokens. This branch is ready to go from the load of the binary. At 1 MB where tokenization dominates, mmap and better cache locality still yield a ~39x speedup.

The mmap alone provides a ~2.4x speedup compared against ifstream (measured with the new trie):

             mmap        ifstream
  1 MB       8.5 ms      20.5 ms    (2.4x faster)

Note: I didn't find an easy way to run Bazel 9 (I'm on 8), so I didn't update MODULE.bazel.lock to avoid noise from downgrading.

Thanks for this project! It inspired me to build an online visualization for tokenizing things using it at https://github.com/eordano/tokencount (live at https://tokencount.eordano.com/)

gen_vocab.py builds a double-array tree from the vocabulary at code-gen
time. At runtime there is nothing to initialize — just two static
uint32 arrays and a greedy longest-match loop. File reads use mmap
instead of ifstream.

Benchmark (1MB of War and Peace, hyperfine -N, 500 runs):

             optimized     upstream
  1 B        260 µs        91 ms      (350×)
  1 MB       5.6 ms        258 ms     (46×)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rohangpta
Copy link
Copy Markdown
Owner

Nice website and thank you for the patch! I will likely accept this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants