This repository showcases the integration between Agent Voice Response and Deepgram's Speech-to-Speech API. The application leverages Deepgram's powerful speech processing capabilities to provide intelligent, context-aware responses in real-time audio format.
To set up and run this project, you will need:
- Node.js and npm installed
- A Deepgram API key with access to the Speech-to-Speech API
- WebSocket support in your environment
git clone https://github.com/agentvoiceresponse/avr-sts-deepgram.git
cd avr-sts-deepgramnpm installCreate a .env file in the root of the project (see .env.example). The following variables are available:
Required:
| Variable | Description |
|---|---|
DEEPGRAM_API_KEY |
Your Deepgram API key |
AGENT_PROMPT |
System prompt that defines the AI agent's behavior and personality |
Optional -- Server:
| Variable | Description | Default |
|---|---|---|
PORT |
WebSocket server port | 6033 |
Optional -- Audio Input:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_SAMPLE_RATE |
Fallback sample rate used when input/output specific rates are not set | 8000 |
DEEPGRAM_INPUT_ENCODING |
Audio encoding for the input stream (linear16, mulaw, alaw) |
linear16 |
DEEPGRAM_INPUT_SAMPLE_RATE |
Sample rate in Hz for the input stream | Value of DEEPGRAM_SAMPLE_RATE |
Optional -- Audio Output:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_OUTPUT_ENCODING |
Audio encoding for the output stream (linear16, mulaw, alaw) |
linear16 |
DEEPGRAM_OUTPUT_SAMPLE_RATE |
Sample rate in Hz for the output stream | Value of DEEPGRAM_SAMPLE_RATE |
DEEPGRAM_OUTPUT_CONTAINER |
Output audio container format (none, wav) |
none |
Optional -- Agent:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_LANGUAGE |
Agent language code (e.g. en, it, es, fr, de) |
en |
DEEPGRAM_GREETING |
Initial greeting message spoken by the agent | Hi there, I'm your virtual assistant—how can I help today? |
Optional -- Listen (STT) Provider:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_LISTEN_PROVIDER |
Speech-to-text provider (deepgram) |
deepgram |
DEEPGRAM_ASR_MODEL |
STT model name | nova-3 |
Optional -- Think (LLM) Provider:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_THINK_PROVIDER |
LLM provider (open_ai, anthropic, groq, google) |
open_ai |
DEEPGRAM_THINK_MODEL |
LLM model name | gpt-4o-mini |
Optional -- Speak (TTS) Provider:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_SPEAK_PROVIDER |
Text-to-speech provider (deepgram, eleven_labs) |
deepgram |
DEEPGRAM_TTS_MODEL |
TTS model name | aura-2-thalia-en |
Note: The TTS model name encodes the language (e.g.
aura-2-thalia-enfor English,aura-2-melia-itfor Italian). Make sure the model matches the language set inDEEPGRAM_LANGUAGE, otherwise the connection will fail.Available Deepgram Aura-2 Italian models (
it):aura-2-melia-it,aura-2-elio-it,aura-2-flavio-it,aura-2-maia-it,aura-2-cinzia-it,aura-2-cesare-it,aura-2-livia-it,aura-2-perseo-it,aura-2-dionisio-it,aura-2-demetra-itFull list of models: https://developers.deepgram.com/docs/tts-models
Optional -- Advanced:
| Variable | Description | Default |
|---|---|---|
DEEPGRAM_KEEPALIVE_INTERVAL |
Keep-alive ping interval in milliseconds | 5000 |
AMI_URL |
URL of the AMI service used by call-control tools (avr_transfer, avr_hangup) |
http://127.0.0.1:6006 |
Start the application by running the following command:
node index.jsThe server will start on the port defined in the environment variable (default: 6033).
The Agent Voice Response system integrates with Deepgram's Speech-to-Speech API to provide intelligent audio-based responses to user queries. The server receives audio input from users, forwards it to Deepgram's API, and then returns the model's response as audio in real-time using WebSocket communication.
- Express.js Server: Handles incoming audio streams from clients
- WebSocket Communication: Manages real-time communication with Deepgram's API
- Audio Processing: Handles audio format conversion and streaming
- Real-time Streaming: Processes and streams audio data in real-time
The application is configured to work with:
- Input Audio: 16-bit PCM at 8kHz
- Output Audio: 16-bit PCM at 8kHz
- Encoding: Linear16 format
This endpoint accepts an audio stream and returns a streamed audio response generated by Deepgram.
Request:
- Content-Type: audio/x-raw
- Format: 16-bit PCM at 8kHz
- Method: POST
Response:
- Content-Type: text/event-stream
- Format: 16-bit PCM at 8kHz
- Streamed audio data in real-time
See the Environment Variables section above for the full list of configurable options.
The application includes comprehensive error handling for:
- WebSocket connection issues
- Audio processing errors
- Deepgram API errors
- Stream processing errors
All errors are logged to the console and appropriate error messages are returned to the client.
We would like to express our gratitude to all the contributors who have helped make this project possible:
- Mirko Bertone - For their valuable contributions and support
- GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
- Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
- Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
- NPM: https://www.npmjs.com/~agentvoiceresponse - Browse our packages.
- Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.
AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.
MIT License - see the LICENSE file for details.