Implementation of the FlashSpeech. For all details check out our paper accepted to ACM MM 2024: FlashSpeech: Efficient Zero-Shot Speech Synthesis.
- This project is a modified version based on Amphion's NaturalSpeech2 due to the use of some internal Microsoft tools in the original code.
- Environment Setup:
bash env.sh
- I have replaced Amphion's
acceleratewithlightningbecause I encountered similar issues (related issue). Training withlightningis faster.
- Modify
ns2dataset.pybased on your data. - This version has been tested on the LibriTTS dataset. Ensure you have the following data prepared in advance:
- Pitch
- Code
- Phoneme
- Duration
- Run the Training Script:
bash egs/tts/NaturalSpeech2/run_train.sh
Important Notes:
-
Choose Configuration:
- You can select either
***_s1or***_s2configuration files based on the training stage.
- You can select either
-
Modify Model Codec:
- In
models/tts/naturalspeech2/flashspeech.py, update the codec to your own. - Adjust
self.latent_normto normalize the codec latent to the standard deviation. (This step is crucial for training the consistency model.)
- In
-
Stage 2 Setup:
- In
models/tts/naturalspeech2/flashspeech_trainer_stage2.py, set the initial weights obtained from Stage 1 training.
- In
-
Stage 3 Development:
- The code for Stage 3 is not yet released. However, you can refer to Stage 1's consistency training to implement it.
Further organize the project structure and complete the remaining code.
Special thanks to Amphion, as our codebase is primarily borrowed from Amphion.
Thank you for using FlashSpeech!