Skip to content

integrated expand date function into the preprocessing function#107

Open
AdiistheGoat wants to merge 1 commit intoKittenML:mainfrom
AdiistheGoat:main
Open

integrated expand date function into the preprocessing function#107
AdiistheGoat wants to merge 1 commit intoKittenML:mainfrom
AdiistheGoat:main

Conversation

@AdiistheGoat
Copy link
Copy Markdown

@AdiistheGoat AdiistheGoat commented Mar 9, 2026

Summary

This PR adds numeric date expansion to the preprocessing pipeline so common date formats are converted into a natural spoken form before inference. This prevents the model from reading dates character-by-character (e.g., “zero three slash…”) and improves pronunciation.

Audio comparison

  • test_date_raw.wav: output with the current preprocessing script
  • test_date.wav: output with the updated preprocessing script

Attachments:


Changes

1) New regex patterns (around lines ~215–220)

Added three standalone date matchers (with lookbehind/lookahead to avoid matching parts of larger numbers):

  • _RE_DATE_slash: slash-separated dates (D/M/YY, DD/MM/YYYY)
  • _RE_DATE_hyphen: hyphen-separated dates (D-M-YY, DD-MM-YYYY)
  • _RE_DATE_ISO: ISO-style dates (YYYY-M-D, YYYY-MM-DD)

2) New expand_dates(text) helper

Introduced expand_dates(text) to normalize numeric date formats into a spoken, human-readable form.

Examples

  • 03/14/202214th March twenty twenty-two
  • 2022-03-1414th March twenty twenty-two

Supported formats

  • Slash style: D/M/YY, DD/MM/YYYY (also handles M/D/... where applicable)
  • Hyphen style: D-M-YY, DD-MM-YYYY
  • ISO style: YYYY-MM-DD

Day/month resolution (ambiguity handling)
Numeric dates like 03/04/2022 are ambiguous. Resolution rules:

  • If one of the first two components is > 12, that component must be the day
  • If both are <= 12, default to DD/MM

3) Integrated into the pipeline (around line ~910)

Added expand_dates(text) at the beginning of TextPreprocessor.process() so dates are expanded before other transformations.


Validation & safety behavior

To avoid accidental or unsafe rewrites, the function applies basic validation before converting:

  • If day == 0 or month == 0, leave the original substring unchanged
  • Range checks:
    • day <= 31
    • month <= 12
  • No full calendar validation (e.g., 30/02/2022 is not rejected at the calendar level)
  • If a match fails validation, the original text is preserved unchanged

Year handling

  • Two-digit years are normalized as 2000 + year
    • Example: 222022
  • The normalized year is converted to words via the existing number_to_words() helper

Output format

All supported inputs normalize to the canonical spoken form:

{ordinal_day} {MonthName} {year_in_words}

Example:

  • 14/03/202214th March twenty twenty-two

Supporting mappings introduced:

  • day_mappings: day number → ordinal (e.g., 1 -> 1st, 2 -> 2nd, …)
  • month_mappings: month number → month name (e.g., 1 -> January, 2 -> February, …)

Why this matters

This update makes text more TTS-friendly by converting numeric date formats into natural language, improving intelligibility and overall output quality.


Date Expansion Examples

The expand_dates(text) function converts supported numeric date formats into a spoken form.

Common cases

Input Output
03/14/2022 14th March two thousand twenty-two
2022-03-14 14th March two thousand twenty-two
14/03/2022 14th March two thousand twenty-two
14-03-2022 14th March two thousand twenty-two
03/14/22 14th March two thousand twenty-two
2022-03-21 21st March two thousand twenty-two

Ambiguous date handling

When both the first and second numeric parts are <= 12, the function defaults to DD/MM.

Input Output
03/04/2022 3rd April two thousand twenty-two
05/04/2022 5th April two thousand twenty-two
04/05/2022 4th May two thousand twenty-two
01/02/03 1st February two thousand three

Mixed format support

Input Output
7/8/22 7th August two thousand twenty-two
08/07/2022 8th July two thousand twenty-two
2022-08-07 7th August two thousand twenty-two
03-17-2022 17th March two thousand twenty-two

Invalid or protected cases

These cases are intentionally left unchanged when they should not be treated as dates or do not pass validation.

Input Output
192.168.03.14 192.168.03.14
03/14 03/14
00/12/2022 00/12/2022
12/00/2022 12/00/2022
2022-00-10 2022-00-10
03 / 14 / 2022 03 / 14 / 2022
2022 - 03 - 14 2022 - 03 - 14

Edge cases

Input Output
02/30/2022 30th February two thousand twenty-two
02/29/2024 29th February two thousand twenty-four
02/29/2023 29th February two thousand twenty-three
03/14/99 14th March two thousand ninety-nine
03/14/00 14th March two thousand
03/14/68 14th March two thousand sixty-eight
run-2022-03-14-0007 run-14th March two thousand twenty-two-0007

@Krystal5222
Copy link
Copy Markdown

Krystal5222 commented Mar 9, 2026 via email

@AdiistheGoat
Copy link
Copy Markdown
Author

AdiistheGoat commented Mar 9, 2026

Hey @Krystal5222, thanks for raising this. I may not have explained it clearly enough in the PR description, but I’m happy to update it later today to make it more readable and add more examples as well. In the meantime, the class comments may also provide some additional context. Let me know if there’s anything specific you’d like me to clarify.

Edit: I have improved the PR description. Lmk if something doesn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants