feat: compile-time trie tokenizer, mmap file reads#4
Open
eordano wants to merge 1 commit intorohangpta:mainfrom
Open
feat: compile-time trie tokenizer, mmap file reads#4eordano wants to merge 1 commit intorohangpta:mainfrom
eordano wants to merge 1 commit intorohangpta:mainfrom
Conversation
gen_vocab.py builds a double-array tree from the vocabulary at code-gen
time. At runtime there is nothing to initialize — just two static
uint32 arrays and a greedy longest-match loop. File reads use mmap
instead of ifstream.
Benchmark (1MB of War and Peace, hyperfine -N, 500 runs):
optimized upstream
1 B 260 µs 91 ms (350×)
1 MB 5.6 ms 258 ms (46×)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
Nice website and thank you for the patch! I will likely accept this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added a double-array trie to replace the runtime heap trie. This has less initialization time than vector maps, and it's already in the memory-mapped binary. The array lookups are more cache-friendly too. I added some other optimizations, trying to squeeze the most juice we can (mmap file reads save one memory copy, -c opt for -O2). Benchmarked with hyperfine -N --warmup 10 --runs 500 on https://www.gutenberg.org/cache/epub/2600/pg2600.txt on an average work laptop:
The gap at small inputs is explained by startup cost: ~121 ms of initialization cost of building a heap trie from the 38k tokens. This branch is ready to go from the load of the binary. At 1 MB where tokenization dominates, mmap and better cache locality still yield a ~39x speedup.
The mmap alone provides a ~2.4x speedup compared against ifstream (measured with the new trie):
Note: I didn't find an easy way to run Bazel 9 (I'm on 8), so I didn't update MODULE.bazel.lock to avoid noise from downgrading.
Thanks for this project! It inspired me to build an online visualization for tokenizing things using it at https://github.com/eordano/tokencount (live at https://tokencount.eordano.com/)