Releases · sign/utf8-tokenizer

17 Feb 21:12

AmitMY

v0.8.2

e86107e

v0.8.2 Latest

Latest

Bug Fix

fix(logits-processor): Handle orphan continuation bytes after ASCII in _analyze_utf8_state. Previously, sequences like [0x41, 0x80] (ASCII followed by orphan continuation byte) caused a KeyError: 'first_byte' in _select_continuation_mask during generate(). ASCII bytes are now always treated as complete, so trailing orphan continuation bytes are properly ignored.

Assets 2

09 Feb 13:09

AmitMY

v0.8.0

217675c

v0.8.0

What's Changed

Bug Fixes

byte-embeddings: Defer bits table creation so torch.device('meta') contexts (Transformers v5 from_pretrained) never produce a corrupted base table
logits-processor: Only constrain continuation bytes in incomplete UTF-8 sequences; skip masking entirely when the current character is complete

Tests

Add meta-device init test for PatchedBitEmbeddings
Add generation identity test verifying the processor doesn't alter output on valid UTF-8 models
Update logits processor tests to reflect lighter-touch masking behavior

Chores

Clean up imports and modernize type hints

Full Changelog: v0.7.1...v0.8.0

Assets 2

31 Jan 08:36

AmitMY

v0.7.1

1773609

v0.7.1

Bug Fixes

char-embeddings: Remove explicit @torch.compile() decorators that cause hangs on macOS. Users can apply torch.compile at the model level if needed.

Assets 2

30 Jan 17:33

AmitMY

v0.7.0

990eb35

v0.7.0

Bug Fixes

byte-embeddings: Use correct tie_weights method for HuggingFace Transformers v5 (the method was renamed from tie_embeddings in v5)

Assets 2

25 Jan 13:13

AmitMY

v0.6.4

8fd82e8

v0.6.4

Bug Fixes

byte-embeddings: Improve torch.compile compatibility
- Disable torch dynamo on _needs_refresh method
- Use torch.nn.functional.embedding for proper gradient handling
- Update tests to use torch.long dtype

Assets 2

12 Jan 20:41

AmitMY

v0.6.3

683d2d4

v0.6.3

Changes

Add load_base_config parameter to CharacterCausalLMConfig to skip loading base model config in tests
Fix UTF-32 test to use valid Unicode codepoints only (≤ 0x10FFFF)

Assets 2

12 Jan 20:35

AmitMY

v0.6.2

33114c9

v0.6.2

Changes

Inherit base model config attributes in CharacterCausalLMConfig

Assets 2

12 Jan 13:59

AmitMY

v0.6.1

c504994

v0.6.1

Changes

Remove tokenizer dependency from CharacterCausalLMWrapper
Add inputs_embeds support to forward pass
Extract loss computation to compiled method for performance
Add UTF-32 byte restrictions for valid codepoints
Add torch.compile decorators for performance optimization
Fix double-wrapping check in training script

Assets 2

11 Jan 08:47

AmitMY

v0.6.0

36da2c7

v0.6.0

What's New

UTF-16 and UTF-32 Support

Added UTF16Tokenizer and UTF32Tokenizer classes for multi-byte Unicode encodings
New CharacterEmbedding class for efficient byte-level embeddings of UTF-16/UTF-32 tokens
New CharacterCausalLMWrapper for autoregressive generation with character-level models

Changes

Replaced groups module with streamlined CharacterEmbedding implementation
Updated run_clm.py to use --encoding parameter (utf8/utf16/utf32) instead of --group_bytes
Added training scripts for UTF-16 and UTF-32 language models

Improvements

Optimized byte splitting using vectorized broadcast bit shift operations
Support for inputs_embeds in CharacterCausalLMWrapper.generate()

Assets 2

09 Jan 15:08

AmitMY

v0.5.0

8e71145

v0.5.0

What's New

UTF-16 and UTF-32 Tokenizer Support

New UTFTokenizer base class - Generalized architecture supporting multiple Unicode encodings
UTF16Tokenizer - Tokenize text as UTF-16 code units (uint16)
UTF32Tokenizer - Tokenize text as UTF-32 code points (uint32)
Performance improvements - __slots__ and local variable caching in hot paths
Renamed embeddings.py to byte_embeddings.py for clarity

Breaking Changes

pad_bytearrays_to_tensor now requires dtype as second argument

Example Usage

from utf8_tokenizer import UTF8Tokenizer, UTF16Tokenizer, UTF32Tokenizer

# UTF-8 (1-4 bytes per character)
utf8 = UTF8Tokenizer()
utf8.torch(["hello"], padding=True)

# UTF-16 (2 bytes per code unit, surrogate pairs for emoji)
utf16 = UTF16Tokenizer()
utf16.torch(["hello"], padding=True)

# UTF-32 (4 bytes per character, one code point each)
utf32 = UTF32Tokenizer()
utf32.torch(["hello"], padding=True)

Assets 2

Uh oh!

Releases: sign/utf8-tokenizer

v0.8.2

Bug Fix

Uh oh!

v0.8.0

What's Changed

Bug Fixes

Tests

Chores

Uh oh!

v0.7.1

Bug Fixes

Uh oh!

v0.7.0

Bug Fixes

Uh oh!

v0.6.4

Bug Fixes

Uh oh!

v0.6.3

Changes

Uh oh!

v0.6.2

Changes

Uh oh!

v0.6.1

Changes

Uh oh!

v0.6.0

What's New

UTF-16 and UTF-32 Support

Changes

Improvements

Uh oh!

v0.5.0

What's New

UTF-16 and UTF-32 Tokenizer Support

Breaking Changes

Example Usage

Uh oh!