Releases: sign/utf8-tokenizer
Releases · sign/utf8-tokenizer
v0.8.2
Bug Fix
- fix(logits-processor): Handle orphan continuation bytes after ASCII in
_analyze_utf8_state. Previously, sequences like[0x41, 0x80](ASCII followed by orphan continuation byte) caused aKeyError: 'first_byte'in_select_continuation_maskduringgenerate(). ASCII bytes are now always treated as complete, so trailing orphan continuation bytes are properly ignored.
v0.8.0
What's Changed
Bug Fixes
- byte-embeddings: Defer bits table creation so
torch.device('meta')contexts (Transformers v5from_pretrained) never produce a corrupted base table - logits-processor: Only constrain continuation bytes in incomplete UTF-8 sequences; skip masking entirely when the current character is complete
Tests
- Add meta-device init test for
PatchedBitEmbeddings - Add generation identity test verifying the processor doesn't alter output on valid UTF-8 models
- Update logits processor tests to reflect lighter-touch masking behavior
Chores
- Clean up imports and modernize type hints
Full Changelog: v0.7.1...v0.8.0
v0.7.1
v0.7.0
v0.6.4
v0.6.3
v0.6.2
v0.6.1
Changes
- Remove tokenizer dependency from CharacterCausalLMWrapper
- Add
inputs_embedssupport to forward pass - Extract loss computation to compiled method for performance
- Add UTF-32 byte restrictions for valid codepoints
- Add
torch.compiledecorators for performance optimization - Fix double-wrapping check in training script
v0.6.0
What's New
UTF-16 and UTF-32 Support
- Added
UTF16TokenizerandUTF32Tokenizerclasses for multi-byte Unicode encodings - New
CharacterEmbeddingclass for efficient byte-level embeddings of UTF-16/UTF-32 tokens - New
CharacterCausalLMWrapperfor autoregressive generation with character-level models
Changes
- Replaced
groupsmodule with streamlinedCharacterEmbeddingimplementation - Updated
run_clm.pyto use--encodingparameter (utf8/utf16/utf32) instead of--group_bytes - Added training scripts for UTF-16 and UTF-32 language models
Improvements
- Optimized byte splitting using vectorized broadcast bit shift operations
- Support for
inputs_embedsinCharacterCausalLMWrapper.generate()
v0.5.0
What's New
UTF-16 and UTF-32 Tokenizer Support
- New
UTFTokenizerbase class - Generalized architecture supporting multiple Unicode encodings UTF16Tokenizer- Tokenize text as UTF-16 code units (uint16)UTF32Tokenizer- Tokenize text as UTF-32 code points (uint32)- Performance improvements -
__slots__and local variable caching in hot paths - Renamed
embeddings.pytobyte_embeddings.pyfor clarity
Breaking Changes
pad_bytearrays_to_tensornow requiresdtypeas second argument
Example Usage
from utf8_tokenizer import UTF8Tokenizer, UTF16Tokenizer, UTF32Tokenizer
# UTF-8 (1-4 bytes per character)
utf8 = UTF8Tokenizer()
utf8.torch(["hello"], padding=True)
# UTF-16 (2 bytes per code unit, surrogate pairs for emoji)
utf16 = UTF16Tokenizer()
utf16.torch(["hello"], padding=True)
# UTF-32 (4 bytes per character, one code point each)
utf32 = UTF32Tokenizer()
utf32.torch(["hello"], padding=True)