Skip to content

Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english)#2015

Open
kinanmartin wants to merge 145 commits intok2-fsa:masterfrom
reazon-research:multi_ja_en_mls_english_clean
Open

Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english)#2015
kinanmartin wants to merge 145 commits intok2-fsa:masterfrom
reazon-research:multi_ja_en_mls_english_clean

Conversation

@kinanmartin
Copy link
Copy Markdown

@kinanmartin kinanmartin commented Aug 28, 2025

This PR adds a new recipe, mls_english, and overhauls an existing recipe, multi_ja_en.

In this prior PR, we added the multi_ja_en recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.

Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in icefall. So, we created a new recipe for MLS English, and then updated the multi_ja_en to rely on both reazonspeech recipe and mls_english recipe.

This PR includes both full recipes, so I will close my prior PR which implements the mls_english recipe alone.

Please see multi_ja_en/ASR/README.md and mls_english/ASR/README.md for more details about each recipe.

Please let me and @baileyeet know if there are any comments or concerns.

Summary by CodeRabbit

  • New Features

    • Added MLS English ASR recipe with end-to-end prepare, train, and decode (streaming and non-streaming).
    • Introduced multi-dataset data module, Hugging Face dataset support, on-the-fly features, and MUSAN augmentation.
    • Added utilities: dataset downloader, subset creation by hours, transcript generation, BPE training, and manifest path updater.
    • Provided tokenizer support (BPE/char).
  • Documentation

    • New MLS English README and RESULTS with WERs and commands.
    • Updated multi_ja_en README/RESULTS to use MLS English + ReazonSpeech, with revised commands and metrics.
  • Refactor

    • Reorganized data/manifests layout and language assets; unified SentencePiece pipeline.
  • Chores

    • Improved device selection for Whisper Fbank; added shared Zipformer links.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants