Handle multibyte characters correctly in process (fixes #31) by Saito-K03 · Pull Request #35 · ufal/SimulStreaming

Saito-K03 · 2026-02-27T06:14:31Z

Summary

This PR addresses issue #31 by checking raw bytes to prevent failures caused by multibyte (3-byte) characters.

Background

In some cases, text containing 3-byte characters (e.g., Japanese) could be processed incorrectly and lead to an “index out of range” error.
These cases tend to include two consecutive illegal bytes, and 3-byte characters can easily trigger this situation.

Commit 1

hide_incomplete_unicode now uses self.model.tokenizer.encoding.decode_bytes_batch instead of self.model.tokenizer.split_tokens_on_unicode. Token decoding still occurs once per iteration.
The replacement-character (�) check now runs per token, and all detected tokens are buffered.

Commit 2

After Commit 1, I observed an increased frequency of pop from empty list: frames.pop(0) exceptions.
I found that timestamped_text adds frames depending on the length of self.unicode_buffer. Previously, the untranslated part of generation[result][frames] often compensated for this, but the increased buffer size made the behavior unstable. This commit adds an additional fix:

Introduced a new variable frame_buffer to record the length of unicode_buffer and pad/fill frames accordingly.

Commit 3

iteration_output sometimes contains only is_final and emission_time, which caused a KeyError.
Example: {"is_final": false, "emission_time": 57.90499949455261}

Added a guard to ensure iteration_output has at least the start key before accessing it.

Result

With these fixes applied, SimulStreaming worked without issues during a 10-hour field test in a Japanese environment.

References

Tokenizer may cause "string index out of range" on Japanese #31

P.S.

This is my first time using the “git” side of GitHub. If I made any mistakes in the PR, I’d appreciate any guidance.

…"pop from empty list"

Saito-K03 added 3 commits February 27, 2026 13:50

hide_incomplete_unicode now works with 3 (or more) bytes characters

8c35eb7

Fix problem where frame buffer was added on wrong iteration, causing …

f7ed3f3

…"pop from empty list"

Add non-json output tolerance against empty output

6834b41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle multibyte characters correctly in process (fixes #31)#35

Handle multibyte characters correctly in process (fixes #31)#35
Saito-K03 wants to merge 3 commits intoufal:mainfrom
Saito-K03:Multibyte-char

Saito-K03 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Saito-K03 commented Feb 27, 2026

Summary

Background

Commit 1

Commit 2

Commit 3

Result

References

P.S.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant