Fix bytelevel decode of added tokens + 27x faster deserialization by ArthurZucker · Pull Request #1995 · huggingface/tokenizers

ArthurZucker · 2026-03-27T21:24:35Z

When adding the token we should normalize it if normalize
Change Ahocorasic for daahorse
Change the algo to simplify.
Add python test for ByteLevel decode with normalizer

Todo: check if normalizer ByteLevel vs PreTokenizer which is better?
mains vs this branch

non special deserialize_added_vocab_100000_norm_none
                        time:   [7.5797 s 7.6353 s 7.6861 s]
                        change: [+2386.1% +2424.7% +2460.7%] (p = 0.00 < 0.10)
                        Performance has regressed.

daahorse just insane here.

gains mostly on non-special (probably because default is non special == normalize)

Also previously every add_tokens call renormalized the entire vocab to build the regex, now it only normalizes what's new. This can be wrong if normalizer changes tho. We could "guard" against this by updating tokenizer.normalizer setter to refresh added tokens.

HuggingFaceDocBuilderDev · 2026-03-27T21:27:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-03-27T22:53:08Z

bindings/python/tests/bindings/test_tokenizer.py

+        print(tokenizer.get_added_tokens_decoder())
+        enc = tokenizer.encode(new_tokens[0].content  + new_tokens[1].content + " " + new_tokens[2].content)
+        print(enc)
+        assert tokenizer.decode(enc.ids, False) == 'Za\rnimokućameđa'


the 1st has normalized false -> broken but the next 2 are properly re-constructed

we don't need the model....

540b5be

ArthurZucker requested a review from McPatate March 27, 2026 21:28

ArthurZucker added 4 commits March 27, 2026 22:29

fmt

74b69b3

nit

ae755ad

better version?

5e20a0e

test + the only viable fix

9c374ce

ArthurZucker mentioned this pull request Mar 27, 2026

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r') #1996

Open

8 tasks

ArthurZucker added 3 commits March 27, 2026 23:45

up

6f6c305

fmt

0ed3997

nit

f432a88

ArthurZucker changed the title ~~Draft update of added token refreshing which is a bottleneck~~ Fix bytelevel decode of added tokens Mar 27, 2026

ArthurZucker commented Mar 27, 2026

View reviewed changes

fix

ddb06b9

ArthurZucker changed the title ~~Fix bytelevel decode of added tokens~~ Fix bytelevel decode of added tokens + 27x faster deserialization Mar 27, 2026

fmt + clippy

1188bdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bytelevel decode of added tokens + 27x faster deserialization#1995

Fix bytelevel decode of added tokens + 27x faster deserialization#1995
ArthurZucker wants to merge 10 commits intomainfrom
fix-byte-norm

ArthurZucker commented Mar 27, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

ArthurZucker Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

ArthurZucker Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArthurZucker commented Mar 27, 2026 •

edited

Loading