[Inquiry] Inquiry regarding Text Normalization (TN) workflow for training consistency

Thank you for sharing TaDiCodec! It's an impressive work, and I've been experimenting with training the speech language model on my private dataset.

### Observation
While studying the project structure to ensure my training data aligns with your best practices, I noticed that `requirements.txt` includes several text-processing libraries such as `cn2an`, `pypinyin`, `g2p_en`, and `jieba`. This strongly suggests that the training data likely underwent a specific Text Normalization (TN) process before being fed into the tokenizer.

However, I noticed that the current inference script (`inference_tadicodec.py`) seems to pass raw text directly to the `AutoTokenizer` without this normalization step:

https://github.com/AmphionTeam/TaDiCodec/blob/main/models/tts/tadicodec/inference_tadicodec.py#L223

### The Issue
This creates a distribution mismatch between training and inference. For example, if the training data has "2024" converted to "two thousand and twenty-four" (or Chinese equivalent), but the inference engine receives the raw digits "2024", the model performance drops significantly because it hasn't seen raw digits during training.

### Request & Proposal
I understand that releasing the full data processing pipeline might take time. However, could you kindly share the logic or rules used for Text Normalization during your training phase? (e.g., How do you handle numbers/dates with `cn2an`? Is there a specific G2P step?)

I would be more than happy to implement this normalization logic into the inference pipeline and submit a Pull Request to contribute back to the repository. This would greatly help the community achieve better inference results out-of-the-box.

Thank you for your time and guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inquiry] Inquiry regarding Text Normalization (TN) workflow for training consistency #2

Observation

The Issue

Request & Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Inquiry] Inquiry regarding Text Normalization (TN) workflow for training consistency #2

Description

Observation

The Issue

Request & Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions