check out my blog Real-Time Voice Cloning: A Deep Dive into Revolutionary TTS Technologies!
Real-time-voice-cloner is a open source project contributed by Corentin Jemine. Find the link to the repo here.
We attempted to implement the project from scratch with the corentin J's repo as reference. Working on this has been a great learning experience in the fields of Audio signal processing and deep learning technologies.
We use Jupyter notebooks for preprocessing, model training and inference as it helps in better understanding the flow of things.
The system - voice cloner can be better dubbed as a multi-speaker Text To Speech (TTS) system. Its main capablity is to generate speech from text in the voice of the provided target speaker with only 5 seconds of audio sample. Traditional TTS systems have beeen built by training their models on huge volumes of transcribed speech data, which usually is a very costly affair.
The project is a implementation research paper - Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS). The paper proproses an encoder system based on d-vectors used for speaker recoginition activities to generate embeddings on the audio samples. These embeddings capture the voice characterteristics of the speaker and helps tell apart different speakers from one another.
The system consists of three main components
encoder- Sub-system to generate the embeddings of audio samples.synthesizer- This is the heart of the system that generates audio waveform of target speaker corresponding to the provided text.vocoder- This regenerates natural human like speech audio.