A byte-level Byte Pair Encoding (BPE) tokenizer in Go, following GPT-4's
pre-tokenization regex and supporting reserved special tokens (e.g.
<|endoftext|>) that are never split or merged.
go mod download
make build./bin/bpe-tokenizer train -file=<path>
./bin/bpe-tokenizer encode -text="hello world" # → [104 9349 1294]
./bin/bpe-tokenizer decode -ids="104 9349 1294" # → hello worldencode and decode load vocab.model from the current directory, so run
train first.
make download-simple-wikipedia # Simple Wikipedia → data/simple-wikipedia.txt
make download-wikitext2 # WikiText-2 v1 → data/wikitext2-{train,validation,test}.txt
make download-tinystories # TinyStories, ~2 GB → data/TinyStoriesV2-GPT4-*.txt
make download-openwebtext # OpenWebText sample, ~12 GB raw → data/owt_*.txt<|endoftext|> is reserved by default (main.go) and gets a fixed ID right
after the byte range (256). To register a different set:
bpe.NewBPETokenizer([]string{"<|endoftext|>", "<|other|>", ...})IDs are assigned in order (256, 257, …) and persisted in vocab.model.
In bpe/bpe.go:
VOCAB_SIZE— total vocabulary size (default10_000). Merge count isVOCAB_SIZE − 256 − len(specialTokens).GPT4_SPLIT_PATTERN— pre-tokenization regex (mirrors tiktoken'scl100k_base).
- Karpathy, minbpe — https://github.com/karpathy/minbpe
- Let's build the GPT Tokenizer — https://www.youtube.com/watch?v=zduSFxRajkE