Skip to content

Fix bug in strip incomplete words of StreamAtt policy#33

Open
sarapapi wants to merge 4 commits intomainfrom
fix_strip_incomplete_words
Open

Fix bug in strip incomplete words of StreamAtt policy#33
sarapapi wants to merge 4 commits intomainfrom
fix_strip_incomplete_words

Conversation

@sarapapi
Copy link
Copy Markdown
Contributor

Closes #32

@sarapapi sarapapi requested a review from mgaido91 March 27, 2026 10:37
@sarapapi sarapapi self-assigned this Mar 27, 2026
@sarapapi sarapapi added the bug Something isn't working label Mar 27, 2026

@staticmethod
def _strip_incomplete_words(tokens: List[str]) -> List[str]:
def _strip_incomplete_words(self, tokens: List[str]) -> List[str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _strip_incomplete_words(self, tokens: List[str]) -> List[str]:
@staticmethod
def _strip_incomplete_words(tokens: List[str]) -> List[str]:

Comment on lines +196 to +208
# Some tokenizers emit a trailing empty token after punctuation/EOS; drop it first so
# complete outputs like [" output", ".", ""] are not mistaken for incomplete words
while tokens and tokens[-1] == "":
tokens = tokens[:-1]

if not tokens:
return []

last_token = tokens[-1].strip()
# If the hypothesis already ends with punctuation, keep it as a complete segment
if last_token and last_token[-1] in STRONG_PUNCTUATION:
return tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not assume this is the right thing to do. We can try and add this variant but at the very minimum I'd add a flag in the configuration to control this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_strip_incomplete_words cuts complete words

2 participants