A simple dark themed interface for testing and basic context management of the CSM base model by Sesame AI Labs.
As they have mentioned, this is a base model and will require finetuning to get the consistent desired results. Hopefully this interface can get you started.
Generating high-quality speech with customizable parameters
Managing context for better voice consistency
Please read their readme, FAQs, usage policy, misuse and abuse notice here: SesameAILabs/CSM. A copy is also attached in this project's root - SesameAILabs_CSM-README.md
- Handy start script that will set up SesameAILabs/CSM python backend and csm-ui front end for you
- Test & Generate high-quality speech from text using CSM 1B
- Support for multiple speakers
- Save and apply Context management for improved coherence. Any generated audio is automatically transcribed and added to currently active session. See proper usage instructions below
- Text or audio inputs. Whisper model is used to automatically transcribe audio and add that as context
- Audio visualization with animated waveforms
- Add existing audio files to context for better quality
- Adjustable generation parameters
- Dark OLED-friendly theme with silver/white accents
- Save and load sessions with all settings and contexts
- Real-time progress indicators during generation
- Python 3.10 (or higher, but not tested)
- Node.js 18 or higher
- NPM 8 or higher
- Accept gated Huggingface models agreements with your hugging face account:
- Mac ✅ (works on M3, M4, should work with other M based hardware, but for best performance use cuda enabled system)
- Windows (with cuda enabled) ✅
- Linux (with cuda enabled) ✅
- Approve gated Huggingface models with your hugging face account.
- Clone this repository:
git clone https://github.com/yourusername/csm-ui.git
cd csm-ui- Run the start script:
./start.shFor Windows
./start.bat(It will set up everything and run the front end)
Make sure you have already accepted huggingface agreements. Script uses uv (which it tries to install if not already exists).
IMPORTANT: Be patient! The first run will install the required models along with the whisper model for transcriptions. This process can take a while depending on your internet speed. Go get a coffee.
- If everything went well then app should be running at port 1885. Load http://localhost:1885 and you should see CSM UI.
(Python backend is pretty much the original code therefore please refer to CSM original authors' readme if any issues)
- Clone this repository:
git clone https://github.com/yourusername/csm-ui.git
cd csm-ui- Set up the Python environment:
uv venv --python==3.10
source .venv/bin/activate # On Windows, use .venv\Scripts\activate
uv pip install -r requirements.txt- Install the Node.js dependencies:
cd csm-ui
npm install
cd ..You can start both the web interface and the Python backend together using the provided script:
./start.shThe script provides several options to customize how the application runs:
- Run in development mode:
./start.sh --dev - Run on a custom port:
./start.sh --port=8080 - Combine options:
./start.sh --dev --port=8080
By default, the script runs the application in production mode on port 1885.
Or run them separately:
- Start the Next.js development server:
cd csm-ui
npm run dev- In a separate terminal, make sure the Python environment is activated and the current directory is the root of the project.
The web interface will be available at http://localhost:1885.
- Select a speaker (0 or 1)
- Enter the text you want to convert to speech or:
- Record audio directly using the microphone button
- Upload an audio file using the upload button (it will be automatically transcribed)
- Adjust parameters as needed:
- Max Audio Length: Maximum duration of the generated audio in seconds
- Temperature: Controls the randomness of the generation (higher = more random)
- Top-K: Number of most likely tokens to consider at each step
- Click "Generate Speech" to create the audio
The base CSM model doesn't maintain a consistent voice by default. To get more consistent voices you need to tune the model. You can also use below steps for better audio with adding context:
- Always use the same speaker ID (0 or 1) for a character/voice
- Use context effectively - this is crucial for voice consistency
- Create a dedicated context for each voice you want to maintain
- Build up the context by generating several utterances with the same speaker ID
- The more samples you add to the context with the same speaker ID, the more consistent the voice will become
- Save your session to preserve your context for future use
Note that for production-quality consistent voices, the model would need to be fine-tuned on specific voice data.
Sessions allow you to save and restore your generation settings and context for future use:
- Set up your desired parameters, enter text, and select an active context
- Click "Save Session" to store your current configuration
- Give the session a descriptive name and save it
- Later, click "Load Session" to view and restore any saved session
- Your previous text, speaker, parameters, and active context will be restored
All sessions are stored locally in your browser and persist between visits.
Context helps the model maintain a consistent style across multiple utterances:
- Go to the "Manage Context" tab
- Create a new context by entering a name and clicking "Create"
- To add the context from generated speech:
- Switch back to the "Generate Speech" tab and make sure your context is active
- Generate speech, and the result will be added to the active context
- To add existing audio files to context:
- Select a context to make it active
- Click "Add Audio to Context"
- Either:
- Upload an audio file using the upload button
- Record audio directly using the microphone button
- The audio will be automatically transcribed using Whisper
- You can edit the transcription if needed
- Select the speaker ID (use consistently for better results)
- Click "Add to Context"
- For subsequent generations, the model will use all previous utterances as context
This feature allows you to add high-quality audio recordings from external sources to use as context, which can significantly improve the quality and coherence of generated speech.
Once audio is generated, you can download it by clicking the "Download" button.
The interface uses a sleek OLED-friendly dark theme with silver/white accents, designed for both aesthetics and readability. The waveform visualization appears in a silver/white color scheme that complements the overall design.
Original Authors' Important Notes (CSM)
Does this model come with any voices?
The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
Can I converse with the model?
CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
Does it support other languages?
The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:
- Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
- Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
- Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.
This project contains components under different licenses:
- Python code and scripts in the root directory are licensed under the Apache License 2.0.
- The csm-ui directory containing the Next.js application is licensed under the MIT License.
Please see the LICENSE file in each directory for the full license text.
- Massive thanks and credits to team at CSM by Sesame AI Labs.
