Skip to content

Fix bytelevel decode of added tokens + 27x faster deserialization#1995

Open
ArthurZucker wants to merge 10 commits intomainfrom
fix-byte-norm
Open

Fix bytelevel decode of added tokens + 27x faster deserialization#1995
ArthurZucker wants to merge 10 commits intomainfrom
fix-byte-norm

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Mar 27, 2026

  1. When adding the token we should normalize it if normalize
  2. Change Ahocorasic for daahorse
  3. Change the algo to simplify.
  4. Add python test for ByteLevel decode with normalizer

Todo: check if normalizer ByteLevel vs PreTokenizer which is better?
mains vs this branch

non special deserialize_added_vocab_100000_norm_none
                        time:   [7.5797 s 7.6353 s 7.6861 s]
                        change: [+2386.1% +2424.7% +2460.7%] (p = 0.00 < 0.10)
                        Performance has regressed.

daahorse just insane here.

image

gains mostly on non-special (probably because default is non special == normalize)

Also previously every add_tokens call renormalized the entire vocab to build the regex, now it only normalizes what's new. This can be wrong if normalizer changes tho. We could "guard" against this by updating tokenizer.normalizer setter to refresh added tokens.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from McPatate March 27, 2026 21:28
@ArthurZucker ArthurZucker changed the title Draft update of added token refreshing which is a bottleneck Fix bytelevel decode of added tokens Mar 27, 2026
print(tokenizer.get_added_tokens_decoder())
enc = tokenizer.encode(new_tokens[0].content + new_tokens[1].content + " " + new_tokens[2].content)
print(enc)
assert tokenizer.decode(enc.ids, False) == 'Za\rnimokućameđa'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 1st has normalized false -> broken but the next 2 are properly re-constructed

@ArthurZucker ArthurZucker changed the title Fix bytelevel decode of added tokens Fix bytelevel decode of added tokens + 27x faster deserialization Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants