[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token)#1887
[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token)#1887yuekaizhang wants to merge 8 commits intok2-fsa:masterfrom
Conversation
| See https://arxiv.org/pdf/2407.05361. | ||
|
|
||
| > [!CAUTION] | ||
| > The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS). |
There was a problem hiding this comment.
i think the terms & conditions may have been taken from another framework & the name changed?
may be safest to just delete this . (Assuming we decide it makes sense to merge the PR overall, which we can discuss separately.)
There was a problem hiding this comment.
Yeah, I copied from libritts recipe here https://github.com/k2-fsa/icefall/tree/master/egs/libritts/TTS#readme. Deleted now.
|
This seems like good work, and it's nice that you want to include it in our collection of recipes. |
Thank you for your feedback! Indeed, the structure of this PR differs from other recipes in Icefall. Initially, I planned to implement it using Lhotse and Icefall training loops. However, I found it simpler to use the Hugging Face dataset and trainer since it's a language model token prediction task. I have added a [Not for merge] tag so that people can still reference the results in the PR. |
Inspired by Llasa, this PR enables continued pretraining of the Qwen2 LLM for the CosyVoice2 semantic token prediction task.
The predicted semantic tokens can be used to generate audio with either the CosyVoice2 pretrained U-Net model or the DIT model PR.