Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages
Description
In several supported languages (e.g., Italian, French, Spanish, English), certain small number words — such as “one”, “un”, “une”, “uno”, “una” — can act either as:
At the moment, alpha2digit only converts these words to digits if they are part of a recognized numeric group, or if the global threshold is set low enough (e.g., threshold=0).
This behavior can lead to unintuitive results when such words appear inside mixed numeric sequences (containing both digits and number words).
Example
from text_to_num import alpha2digit
text = "The code is 7 one 8 0."
print(alpha2digit(text, "en"))
Current output:
Expected / Desired output:
Here, “one” is clearly part of a numeric sequence, but it remains unconverted because it’s treated as an isolated number word below the threshold.
Rationale
In many real-world scenarios (ASR transcripts mainly), numeric sequences often include a mix of digits and number words.
When a small number word (value ≤ 3) appears adjacent to digits or other number words, it should be reasonably interpreted as part of that sequence and converted — regardless of the threshold parameter.
This change would:
-
Improve conversion accuracy across languages,
-
Better reflect numeric context in mixed sequences,
-
Maintain backward compatibility for truly isolated uses (e.g., “I have one apple”).
Proposed enhancement
Extend alpha2digit’s logic so that:
-
If a numeric word is adjacent to digits or other number words, treat it as part of a numeric sequence and bypass the threshold check.
-
Otherwise, keep the current behavior (use threshold to decide conversion).
Examples
| Input |
Language |
Current |
Desired |
| 7 one 8 |
en |
7 one 8 |
7 1 8 |
| 7 un 8 |
fr |
7 un 8 |
7 1 8 |
| 7 uno 8 |
it |
7 uno 8 |
7 1 8 |
| I have one apple |
en |
✅ correct |
✅ same |
| J’ai une pomme |
fr |
✅ correct |
✅ same |
Implementation idea
Add a lightweight heuristic to alpha2digit:
“If a small numeric word (value ≤ threshold) is adjacent to a digit or another number word, treat it as part of a numeric group and convert it.”
This would make conversions more robust across all supported languages without breaking existing semantics for isolated words.
Motivation / Use Case
This improvement would significantly benefit real-world applications such as:
-
ASR (Automatic Speech Recognition) post-processing,
-
Data cleaning pipelines where numeric tokens are mixed or inconsistent.
Potential test cases
from text_to_num import alpha2digit
def test_mixed_sequence_en():
assert alpha2digit("7 one 8 0", "en") == "7 1 8 0"
def test_mixed_sequence_fr():
assert alpha2digit("7 un 8 0", "fr") == "7 1 8 0"
def test_mixed_sequence_it():
assert alpha2digit("7 uno 8 0", "it") == "7 1 8 0"
def test_isolated_article_en():
assert alpha2digit("I have one apple", "en") == "I have one apple"
Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens
Label: Enhancement
Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages
Description
In several supported languages (e.g., Italian, French, Spanish, English), certain small number words — such as “one”, “un”, “une”, “uno”, “una” — can act either as:
numeric words (value = 1), or
indefinite articles or determiners (meaning a/an).
At the moment,
alpha2digitonly converts these words to digits if they are part of a recognized numeric group, or if the globalthresholdis set low enough (e.g.,threshold=0).This behavior can lead to unintuitive results when such words appear inside mixed numeric sequences (containing both digits and number words).
Example
Current output:
Expected / Desired output:
Here, “one” is clearly part of a numeric sequence, but it remains unconverted because it’s treated as an isolated number word below the threshold.
Rationale
In many real-world scenarios (ASR transcripts mainly), numeric sequences often include a mix of digits and number words.
When a small number word (value ≤ 3) appears adjacent to digits or other number words, it should be reasonably interpreted as part of that sequence and converted — regardless of the
thresholdparameter.This change would:
Improve conversion accuracy across languages,
Better reflect numeric context in mixed sequences,
Maintain backward compatibility for truly isolated uses (e.g., “I have one apple”).
Proposed enhancement
Extend
alpha2digit’s logic so that:If a numeric word is adjacent to digits or other number words, treat it as part of a numeric sequence and bypass the
thresholdcheck.Otherwise, keep the current behavior (use
thresholdto decide conversion).Examples
Implementation idea
Add a lightweight heuristic to
alpha2digit:This would make conversions more robust across all supported languages without breaking existing semantics for isolated words.
Motivation / Use Case
This improvement would significantly benefit real-world applications such as:
ASR (Automatic Speech Recognition) post-processing,
Data cleaning pipelines where numeric tokens are mixed or inconsistent.
Potential test cases
Label:
Enhancement