Skip to content

[Inquiry] Inquiry regarding Text Normalization (TN) workflow for training consistency #2

@qianwan

Description

@qianwan

Thank you for sharing TaDiCodec! It's an impressive work, and I've been experimenting with training the speech language model on my private dataset.

Observation

While studying the project structure to ensure my training data aligns with your best practices, I noticed that requirements.txt includes several text-processing libraries such as cn2an, pypinyin, g2p_en, and jieba. This strongly suggests that the training data likely underwent a specific Text Normalization (TN) process before being fed into the tokenizer.

However, I noticed that the current inference script (inference_tadicodec.py) seems to pass raw text directly to the AutoTokenizer without this normalization step:

https://github.com/AmphionTeam/TaDiCodec/blob/main/models/tts/tadicodec/inference_tadicodec.py#L223

The Issue

This creates a distribution mismatch between training and inference. For example, if the training data has "2024" converted to "two thousand and twenty-four" (or Chinese equivalent), but the inference engine receives the raw digits "2024", the model performance drops significantly because it hasn't seen raw digits during training.

Request & Proposal

I understand that releasing the full data processing pipeline might take time. However, could you kindly share the logic or rules used for Text Normalization during your training phase? (e.g., How do you handle numbers/dates with cn2an? Is there a specific G2P step?)

I would be more than happy to implement this normalization logic into the inference pipeline and submit a Pull Request to contribute back to the repository. This would greatly help the community achieve better inference results out-of-the-box.

Thank you for your time and guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions