Skip to content

Handle multibyte characters correctly in process (fixes #31)#35

Open
Saito-K03 wants to merge 3 commits intoufal:mainfrom
Saito-K03:Multibyte-char
Open

Handle multibyte characters correctly in process (fixes #31)#35
Saito-K03 wants to merge 3 commits intoufal:mainfrom
Saito-K03:Multibyte-char

Conversation

@Saito-K03
Copy link

Summary

This PR addresses issue #31 by checking raw bytes to prevent failures caused by multibyte (3-byte) characters.

Background

In some cases, text containing 3-byte characters (e.g., Japanese) could be processed incorrectly and lead to an “index out of range” error.
These cases tend to include two consecutive illegal bytes, and 3-byte characters can easily trigger this situation.

Commit 1

  • hide_incomplete_unicode now uses self.model.tokenizer.encoding.decode_bytes_batch instead of self.model.tokenizer.split_tokens_on_unicode. Token decoding still occurs once per iteration.
  • The replacement-character () check now runs per token, and all detected tokens are buffered.

Commit 2

After Commit 1, I observed an increased frequency of pop from empty list: frames.pop(0) exceptions.
I found that timestamped_text adds frames depending on the length of self.unicode_buffer. Previously, the untranslated part of generation[result][frames] often compensated for this, but the increased buffer size made the behavior unstable. This commit adds an additional fix:

  • Introduced a new variable frame_buffer to record the length of unicode_buffer and pad/fill frames accordingly.

Commit 3

iteration_output sometimes contains only is_final and emission_time, which caused a KeyError.
Example: {"is_final": false, "emission_time": 57.90499949455261}

  • Added a guard to ensure iteration_output has at least the start key before accessing it.

Result

With these fixes applied, SimulStreaming worked without issues during a 10-hour field test in a Japanese environment.

References

P.S.

This is my first time using the “git” side of GitHub. If I made any mistakes in the PR, I’d appreciate any guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant