Skip to content

Releases: sign/utf8-tokenizer

v0.8.2

17 Feb 21:12

Choose a tag to compare

Bug Fix

  • fix(logits-processor): Handle orphan continuation bytes after ASCII in _analyze_utf8_state. Previously, sequences like [0x41, 0x80] (ASCII followed by orphan continuation byte) caused a KeyError: 'first_byte' in _select_continuation_mask during generate(). ASCII bytes are now always treated as complete, so trailing orphan continuation bytes are properly ignored.

v0.8.0

09 Feb 13:09

Choose a tag to compare

What's Changed

Bug Fixes

  • byte-embeddings: Defer bits table creation so torch.device('meta') contexts (Transformers v5 from_pretrained) never produce a corrupted base table
  • logits-processor: Only constrain continuation bytes in incomplete UTF-8 sequences; skip masking entirely when the current character is complete

Tests

  • Add meta-device init test for PatchedBitEmbeddings
  • Add generation identity test verifying the processor doesn't alter output on valid UTF-8 models
  • Update logits processor tests to reflect lighter-touch masking behavior

Chores

  • Clean up imports and modernize type hints

Full Changelog: v0.7.1...v0.8.0

v0.7.1

31 Jan 08:36

Choose a tag to compare

Bug Fixes

  • char-embeddings: Remove explicit @torch.compile() decorators that cause hangs on macOS. Users can apply torch.compile at the model level if needed.

v0.7.0

30 Jan 17:33

Choose a tag to compare

Bug Fixes

  • byte-embeddings: Use correct tie_weights method for HuggingFace Transformers v5 (the method was renamed from tie_embeddings in v5)

v0.6.4

25 Jan 13:13

Choose a tag to compare

Bug Fixes

  • byte-embeddings: Improve torch.compile compatibility
    • Disable torch dynamo on _needs_refresh method
    • Use torch.nn.functional.embedding for proper gradient handling
    • Update tests to use torch.long dtype

v0.6.3

12 Jan 20:41

Choose a tag to compare

Changes

  • Add load_base_config parameter to CharacterCausalLMConfig to skip loading base model config in tests
  • Fix UTF-32 test to use valid Unicode codepoints only (≤ 0x10FFFF)

v0.6.2

12 Jan 20:35

Choose a tag to compare

Changes

  • Inherit base model config attributes in CharacterCausalLMConfig

v0.6.1

12 Jan 13:59

Choose a tag to compare

Changes

  • Remove tokenizer dependency from CharacterCausalLMWrapper
  • Add inputs_embeds support to forward pass
  • Extract loss computation to compiled method for performance
  • Add UTF-32 byte restrictions for valid codepoints
  • Add torch.compile decorators for performance optimization
  • Fix double-wrapping check in training script

v0.6.0

11 Jan 08:47

Choose a tag to compare

What's New

UTF-16 and UTF-32 Support

  • Added UTF16Tokenizer and UTF32Tokenizer classes for multi-byte Unicode encodings
  • New CharacterEmbedding class for efficient byte-level embeddings of UTF-16/UTF-32 tokens
  • New CharacterCausalLMWrapper for autoregressive generation with character-level models

Changes

  • Replaced groups module with streamlined CharacterEmbedding implementation
  • Updated run_clm.py to use --encoding parameter (utf8/utf16/utf32) instead of --group_bytes
  • Added training scripts for UTF-16 and UTF-32 language models

Improvements

  • Optimized byte splitting using vectorized broadcast bit shift operations
  • Support for inputs_embeds in CharacterCausalLMWrapper.generate()

v0.5.0

09 Jan 15:08

Choose a tag to compare

What's New

UTF-16 and UTF-32 Tokenizer Support

  • New UTFTokenizer base class - Generalized architecture supporting multiple Unicode encodings
  • UTF16Tokenizer - Tokenize text as UTF-16 code units (uint16)
  • UTF32Tokenizer - Tokenize text as UTF-32 code points (uint32)
  • Performance improvements - __slots__ and local variable caching in hot paths
  • Renamed embeddings.py to byte_embeddings.py for clarity

Breaking Changes

  • pad_bytearrays_to_tensor now requires dtype as second argument

Example Usage

from utf8_tokenizer import UTF8Tokenizer, UTF16Tokenizer, UTF32Tokenizer

# UTF-8 (1-4 bytes per character)
utf8 = UTF8Tokenizer()
utf8.torch(["hello"], padding=True)

# UTF-16 (2 bytes per code unit, surrogate pairs for emoji)
utf16 = UTF16Tokenizer()
utf16.torch(["hello"], padding=True)

# UTF-32 (4 bytes per character, one code point each)
utf32 = UTF32Tokenizer()
utf32.torch(["hello"], padding=True)