[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token) by yuekaizhang · Pull Request #1887 · k2-fsa/icefall

yuekaizhang · 2025-03-03T06:14:20Z

Inspired by Llasa, this PR enables continued pretraining of the Qwen2 LLM for the CosyVoice2 semantic token prediction task.

The predicted semantic tokens can be used to generate audio with either the CosyVoice2 pretrained U-Net model or the DIT model PR.

LLM Model	Flow matching Model	Seed-TTS test_zh CER	Comment
pretrained cosyvoice2 0.5B	f5-tts-small (wenetspeech4tts 7k hours)	1.79% (16 steps)	See PR
llasa_cosyvoice2_token 0.5B (Emilia_ZH 50k hours)	f5-tts-small (wenetspeech4tts 7k hours)	1.81% (16 steps)

danpovey · 2025-03-03T15:03:10Z

+See https://arxiv.org/pdf/2407.05361.
+
+> [!CAUTION]
+> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).


i think the terms & conditions may have been taken from another framework & the name changed?
may be safest to just delete this . (Assuming we decide it makes sense to merge the PR overall, which we can discuss separately.)

Yeah, I copied from libritts recipe here https://github.com/k2-fsa/icefall/tree/master/egs/libritts/TTS#readme. Deleted now.

danpovey · 2025-03-03T15:07:57Z

This seems like good work, and it's nice that you want to include it in our collection of recipes.
I also find it quite interesting. But I'm trying to come up with a good justification why it should be included here,
other than the fact that we are also interested in the TTS task right now. I.e. are we OK with icefall being a collection of
recipes even in cases where they have very little in common?
(BTW I notice that the instructions direct the user installs the k2 package, but I doubt this is actually needed).
Regardless of whether we merge it (and I'm open to input from our team members and others on this issue), I'm happy to have the pull request left here as an accessible place for discussion about this recipe and so that we can easily find it.

yuekaizhang · 2025-03-04T01:19:31Z

This seems like good work, and it's nice that you want to include it in our collection of recipes. I also find it quite interesting. But I'm trying to come up with a good justification why it should be included here, other than the fact that we are also interested in the TTS task right now. I.e. are we OK with icefall being a collection of recipes even in cases where they have very little in common? (BTW I notice that the instructions direct the user installs the k2 package, but I doubt this is actually needed). Regardless of whether we merge it (and I'm open to input from our team members and others on this issue), I'm happy to have the pull request left here as an accessible place for discussion about this recipe and so that we can easily find it.

Thank you for your feedback! Indeed, the structure of this PR differs from other recipes in Icefall. Initially, I planned to implement it using Lhotse and Icefall training loops. However, I found it simpler to use the Hugging Face dataset and trainer since it's a language model token prediction task.

I have added a [Not for merge] tag so that people can still reference the results in the PR.

yuekaiz and others added 6 commits February 28, 2025 10:01

add token extraction

540430d

add training codes

fa65870

add llasa infer

0f7ebb7

add eval seed tts

d2b473a

clean code

7623939

remove run.sh

bc6e113

yuekaizhang requested a review from JinZr March 3, 2025 06:36

update results

c473192

danpovey reviewed Mar 3, 2025

View reviewed changes

yuekaizhang changed the title ~~Add Emilia Training Recipe for Llasa (cosyvoice2 token)~~ [Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token) Mar 4, 2025

yuekaizhang removed the request for review from JinZr March 4, 2025 01:19

update readme and requirements

1653b76

This was referenced Jul 16, 2025

[Ready] cosyvoice llm rl for TTS nvidia-china-sae/mair-hub#12

Merged

[new recipe] Add Cosyvoice TTS GRPO training recipe based on veRL. verl-project/verl#2615

Open

Cosyvoice2 GRPO RL training recipe FunAudioLLM/CosyVoice#1463

Open

yuekaizhang mentioned this pull request Aug 6, 2025

Speech RL recipe update nvidia-china-sae/mair-hub#24

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token)#1887

[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token)#1887
yuekaizhang wants to merge 8 commits intok2-fsa:masterfrom
yuekaizhang:emilia

yuekaizhang commented Mar 3, 2025 •

edited

Loading

Uh oh!

danpovey Mar 3, 2025

Uh oh!

yuekaizhang Mar 4, 2025

Uh oh!

danpovey commented Mar 3, 2025

Uh oh!

yuekaizhang commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuekaizhang commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

yuekaizhang Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

danpovey commented Mar 3, 2025

Uh oh!

yuekaizhang commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuekaizhang commented Mar 3, 2025 •

edited

Loading