Thank you for sharing TaDiCodec! It's an impressive work, and I've been experimenting with training the speech language model on my private dataset.
Observation
While studying the project structure to ensure my training data aligns with your best practices, I noticed that requirements.txt includes several text-processing libraries such as cn2an, pypinyin, g2p_en, and jieba. This strongly suggests that the training data likely underwent a specific Text Normalization (TN) process before being fed into the tokenizer.
However, I noticed that the current inference script (inference_tadicodec.py) seems to pass raw text directly to the AutoTokenizer without this normalization step:
https://github.com/AmphionTeam/TaDiCodec/blob/main/models/tts/tadicodec/inference_tadicodec.py#L223
The Issue
This creates a distribution mismatch between training and inference. For example, if the training data has "2024" converted to "two thousand and twenty-four" (or Chinese equivalent), but the inference engine receives the raw digits "2024", the model performance drops significantly because it hasn't seen raw digits during training.
Request & Proposal
I understand that releasing the full data processing pipeline might take time. However, could you kindly share the logic or rules used for Text Normalization during your training phase? (e.g., How do you handle numbers/dates with cn2an? Is there a specific G2P step?)
I would be more than happy to implement this normalization logic into the inference pipeline and submit a Pull Request to contribute back to the repository. This would greatly help the community achieve better inference results out-of-the-box.
Thank you for your time and guidance!
Thank you for sharing TaDiCodec! It's an impressive work, and I've been experimenting with training the speech language model on my private dataset.
Observation
While studying the project structure to ensure my training data aligns with your best practices, I noticed that
requirements.txtincludes several text-processing libraries such ascn2an,pypinyin,g2p_en, andjieba. This strongly suggests that the training data likely underwent a specific Text Normalization (TN) process before being fed into the tokenizer.However, I noticed that the current inference script (
inference_tadicodec.py) seems to pass raw text directly to theAutoTokenizerwithout this normalization step:https://github.com/AmphionTeam/TaDiCodec/blob/main/models/tts/tadicodec/inference_tadicodec.py#L223
The Issue
This creates a distribution mismatch between training and inference. For example, if the training data has "2024" converted to "two thousand and twenty-four" (or Chinese equivalent), but the inference engine receives the raw digits "2024", the model performance drops significantly because it hasn't seen raw digits during training.
Request & Proposal
I understand that releasing the full data processing pipeline might take time. However, could you kindly share the logic or rules used for Text Normalization during your training phase? (e.g., How do you handle numbers/dates with
cn2an? Is there a specific G2P step?)I would be more than happy to implement this normalization logic into the inference pipeline and submit a Pull Request to contribute back to the repository. This would greatly help the community achieve better inference results out-of-the-box.
Thank you for your time and guidance!