integrated expand date function into the preprocessing function#107
Open
AdiistheGoat wants to merge 1 commit intoKittenML:mainfrom
Open
integrated expand date function into the preprocessing function#107AdiistheGoat wants to merge 1 commit intoKittenML:mainfrom
AdiistheGoat wants to merge 1 commit intoKittenML:mainfrom
Conversation
|
What does all this mean I am so clueless
…On Mar 9, 2026 1:01 AM, "Aditya Goyal" ***@***.***> wrote:
test_date_raw.wav
<https://github.com/user-attachments/files/25832597/test_date_raw.wav>
test_date.wav
<https://github.com/user-attachments/files/25832598/test_date.wav>
test_date_raw.wav is the output with the current preprocessing script
test_date.wav is the output with the updated preprocessing script
Added Date Expansion Functionality
1. New Regex Patterns (around line 215–220)
Added three regex patterns to match different date formats:
• _RE_DATE_slash: Matches slash-separated dates (D/M/YY or DD/MM/YYYY)
• _RE_DATE_hyphen: Matches hyphen-separated dates (D-M-YY or DD-MM-YYYY)
• _RE_DATE_ISO: Matches ISO format dates (YYYY-M-D or YYYY-MM-DD)
All patterns use lookbehind/lookahead assertions to ensure dates are
standalone (not part of larger numbers).
2. New expand_dates() Function (around line 700–800)
Added a date expansion function that:
• Converts numeric dates to spoken words (e.g., "03/14/2022" → "14th March twenty twenty-two")
• Handles multiple formats: slash dates, hyphen dates, and ISO dates
• Validates dates: day (1–31), month (1–12), and year ranges
• Resolves ambiguity: when both parts are ≤12, defaults to DD/MM; when one part is >12, treats it as the day
• Normalizes 2-digit years: adds 2000 to years <100 (e.g., "22" → "2022")
• Preserves invalid dates by returning the original substring if validation fails
• Uses internal day_mappings (ordinal suffixes) and month_mappings (month names)
• Leverages existing number_to_words() for converting the year to words
3. Integrated into Pipeline (around line 910)
Added an expand_dates(text) call at the beginning of
TextPreprocessor.process(), ensuring dates are expanded before other
transformations.
Summary
This update introduces date normalization that converts numeric date
formats into natural, spoken language—making the text more suitable for TTS
output.
------------------------------
You can view, comment on, or merge this pull request online at:
#107
Commit Summary
- cfc5653
<cfc5653>
integrated expand date function
File Changes
(1 file <https://github.com/KittenML/KittenTTS/pull/107/files>)
- *M* kittentts/preprocess.py
<https://github.com/KittenML/KittenTTS/pull/107/files#diff-2cde1eea64f167aa37f9d061f653f4ea01b0f6b644e2c7193141953e887b09d7>
(116)
Patch Links:
- https://github.com/KittenML/KittenTTS/pull/107.patch
- https://github.com/KittenML/KittenTTS/pull/107.diff
—
Reply to this email directly, view it on GitHub
<#107>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5AYBKCOJJOWQVA7YFRJIJ34PZM5DAVCNFSM6AAAAACWLNC5BWVHI2DSMVQWIX3LMV43ASLTON2WKOZUGA2DGMZWGA3DENI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Author
|
Hey @Krystal5222, thanks for raising this. I may not have explained it clearly enough in the PR description, but I’m happy to update it later today to make it more readable and add more examples as well. In the meantime, the class comments may also provide some additional context. Let me know if there’s anything specific you’d like me to clarify. Edit: I have improved the PR description. Lmk if something doesn't make sense. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds numeric date expansion to the preprocessing pipeline so common date formats are converted into a natural spoken form before inference. This prevents the model from reading dates character-by-character (e.g., “zero three slash…”) and improves pronunciation.
Audio comparison
test_date_raw.wav: output with the current preprocessing scripttest_date.wav: output with the updated preprocessing scriptAttachments:
Changes
1) New regex patterns (around lines ~215–220)
Added three standalone date matchers (with lookbehind/lookahead to avoid matching parts of larger numbers):
_RE_DATE_slash: slash-separated dates (D/M/YY,DD/MM/YYYY)_RE_DATE_hyphen: hyphen-separated dates (D-M-YY,DD-MM-YYYY)_RE_DATE_ISO: ISO-style dates (YYYY-M-D,YYYY-MM-DD)2) New
expand_dates(text)helperIntroduced
expand_dates(text)to normalize numeric date formats into a spoken, human-readable form.Examples
03/14/2022→14th March twenty twenty-two2022-03-14→14th March twenty twenty-twoSupported formats
D/M/YY,DD/MM/YYYY(also handlesM/D/...where applicable)D-M-YY,DD-MM-YYYYYYYY-MM-DDDay/month resolution (ambiguity handling)
Numeric dates like
03/04/2022are ambiguous. Resolution rules:> 12, that component must be the day<= 12, default to DD/MM3) Integrated into the pipeline (around line ~910)
Added
expand_dates(text)at the beginning ofTextPreprocessor.process()so dates are expanded before other transformations.Validation & safety behavior
To avoid accidental or unsafe rewrites, the function applies basic validation before converting:
day == 0ormonth == 0, leave the original substring unchangedday <= 31month <= 1230/02/2022is not rejected at the calendar level)Year handling
2000 + year22→2022number_to_words()helperOutput format
All supported inputs normalize to the canonical spoken form:
{ordinal_day} {MonthName} {year_in_words}Example:
14/03/2022→14th March twenty twenty-twoSupporting mappings introduced:
day_mappings: day number → ordinal (e.g.,1 -> 1st,2 -> 2nd, …)month_mappings: month number → month name (e.g.,1 -> January,2 -> February, …)Why this matters
This update makes text more TTS-friendly by converting numeric date formats into natural language, improving intelligibility and overall output quality.
Date Expansion Examples
The
expand_dates(text)function converts supported numeric date formats into a spoken form.Common cases
03/14/202214th March two thousand twenty-two2022-03-1414th March two thousand twenty-two14/03/202214th March two thousand twenty-two14-03-202214th March two thousand twenty-two03/14/2214th March two thousand twenty-two2022-03-2121st March two thousand twenty-twoAmbiguous date handling
When both the first and second numeric parts are
<= 12, the function defaults toDD/MM.03/04/20223rd April two thousand twenty-two05/04/20225th April two thousand twenty-two04/05/20224th May two thousand twenty-two01/02/031st February two thousand threeMixed format support
7/8/227th August two thousand twenty-two08/07/20228th July two thousand twenty-two2022-08-077th August two thousand twenty-two03-17-202217th March two thousand twenty-twoInvalid or protected cases
These cases are intentionally left unchanged when they should not be treated as dates or do not pass validation.
192.168.03.14192.168.03.1403/1403/1400/12/202200/12/202212/00/202212/00/20222022-00-102022-00-1003 / 14 / 202203 / 14 / 20222022 - 03 - 142022 - 03 - 14Edge cases
02/30/202230th February two thousand twenty-two02/29/202429th February two thousand twenty-four02/29/202329th February two thousand twenty-three03/14/9914th March two thousand ninety-nine03/14/0014th March two thousand03/14/6814th March two thousand sixty-eightrun-2022-03-14-0007run-14th March two thousand twenty-two-0007