Easy Workflow for creating Training Data for TTS.
git clone https://github.com/ItsJamin/another-tts
cd another-tts
docker compose up
-
Create
.venv(Python 3.13.7)python -m venv .venv -
Activate Virtual Environment Linux:
source .venv/bin/activateWindows:.venv\Scripts\activate -
Install Libraries:
pip install -r requirements.txt -
Install
ffmpegif not in system (needs to be available on cmd withffmpeg) -
Create in the
env/directory a.env-File with the following content:
CURRENT_DATASET = "<your_dataset_name>"
CURRENT_LANGUAGE = "de" # currently only german sentencesChange the name of your dataset to what your dataset should be called.
- Start Server with
python app.pyand visitlocalhost:5067
data/sentences/CURRENT_LANGUAGE/- Textfiles of what to saydata/datasets/CURRENT_DATASET/- Place where recordings are saved.data/datasets/CURRENT_DATASET/metadata.csv- Which audiofiles contain what text. (seeexampleDataset)
-
GUI for easily recording Data
-
GUI for Overview over recorded Data
-
Docker Implementation
-
Filter out or show already recorded lines
-
Metadata for voice (mood, whisper, etc.)?
- Transcript text must match the spoken audio verbatim
- No paraphrasing, omissions, or additions
- All text must be fully normalized (numbers, dates, abbreviations, special symbols written out)
- DO use points, question marks, commatas and apostrophs.
- No inconsistent or ambiguous punctuation
- One sentence per line
- Neutral, non-acted language (does not mean boring)
- Lines should vary in length (short, medium, long) to ensure phonetic coverage
- Questions, statements, numbers, dates, and abbreviations must be represented
- The same text normalization rules must be applied consistently across the entire dataset
Examples:
- I have 42 apples. -> I have forty two apples.
- The contract was signed in 2025. -> The contract was signed in twenty twenty five.
- Dr. Smith will arrive at 9 a.m. -> Doctor Smith will arrive at nine a m.
- The battery is at 80%. -> The battery is at eighty percent.
- "Wait, how did you 2 meet again?" -> Wait, how did you two meet again?
- WAV format, PCM (uncompressed), mono
- Consistent sample rate across all files
- Recommended: 44.1 kHz, 16-bit
- One sentence per file
- File duration ideally between 1.5 and 8 seconds
- No leading or trailing silence greater than ~200 ms
- No clipping, distortion, or background noise
- Stable loudness across all recordings
- Target: −20 to −16 LUFS, peak below −1 dBFS
- Recorded in the same environment with the same microphone and setup
- Fixed microphone distance and speaking posture
- Consistent speaking style, pace, and energy level
- No whispering, shouting, or expressive acting
- Speaker must be healthy and vocally consistent across sessions
- File names must be deterministic, sortable, and space-free (e.g.
speaker_0001.wav) - Stereo files, variable sample rates, and inconsistent bit depth are not allowed