Skip to content

agentvoiceresponse/avr-sts-deepgram

Repository files navigation

Agent Voice Response - Deepgram Speech-to-Speech Integration

Discord GitHub Repo stars Docker Pulls Ko-fi

This repository showcases the integration between Agent Voice Response and Deepgram's Speech-to-Speech API. The application leverages Deepgram's powerful speech processing capabilities to provide intelligent, context-aware responses in real-time audio format.

Prerequisites

To set up and run this project, you will need:

  1. Node.js and npm installed
  2. A Deepgram API key with access to the Speech-to-Speech API
  3. WebSocket support in your environment

Setup

1. Clone the Repository

git clone https://github.com/agentvoiceresponse/avr-sts-deepgram.git
cd avr-sts-deepgram

2. Install Dependencies

npm install

3. Configure Environment Variables

Create a .env file in the root of the project (see .env.example). The following variables are available:

Required:

Variable Description
DEEPGRAM_API_KEY Your Deepgram API key
AGENT_PROMPT System prompt that defines the AI agent's behavior and personality

Optional -- Server:

Variable Description Default
PORT WebSocket server port 6033

Optional -- Audio Input:

Variable Description Default
DEEPGRAM_SAMPLE_RATE Fallback sample rate used when input/output specific rates are not set 8000
DEEPGRAM_INPUT_ENCODING Audio encoding for the input stream (linear16, mulaw, alaw) linear16
DEEPGRAM_INPUT_SAMPLE_RATE Sample rate in Hz for the input stream Value of DEEPGRAM_SAMPLE_RATE

Optional -- Audio Output:

Variable Description Default
DEEPGRAM_OUTPUT_ENCODING Audio encoding for the output stream (linear16, mulaw, alaw) linear16
DEEPGRAM_OUTPUT_SAMPLE_RATE Sample rate in Hz for the output stream Value of DEEPGRAM_SAMPLE_RATE
DEEPGRAM_OUTPUT_CONTAINER Output audio container format (none, wav) none

Optional -- Agent:

Variable Description Default
DEEPGRAM_LANGUAGE Agent language code (e.g. en, it, es, fr, de) en
DEEPGRAM_GREETING Initial greeting message spoken by the agent Hi there, I'm your virtual assistant—how can I help today?

Optional -- Listen (STT) Provider:

Variable Description Default
DEEPGRAM_LISTEN_PROVIDER Speech-to-text provider (deepgram) deepgram
DEEPGRAM_ASR_MODEL STT model name nova-3

Optional -- Think (LLM) Provider:

Variable Description Default
DEEPGRAM_THINK_PROVIDER LLM provider (open_ai, anthropic, groq, google) open_ai
DEEPGRAM_THINK_MODEL LLM model name gpt-4o-mini

Optional -- Speak (TTS) Provider:

Variable Description Default
DEEPGRAM_SPEAK_PROVIDER Text-to-speech provider (deepgram, eleven_labs) deepgram
DEEPGRAM_TTS_MODEL TTS model name aura-2-thalia-en

Note: The TTS model name encodes the language (e.g. aura-2-thalia-en for English, aura-2-melia-it for Italian). Make sure the model matches the language set in DEEPGRAM_LANGUAGE, otherwise the connection will fail.

Available Deepgram Aura-2 Italian models (it): aura-2-melia-it, aura-2-elio-it, aura-2-flavio-it, aura-2-maia-it, aura-2-cinzia-it, aura-2-cesare-it, aura-2-livia-it, aura-2-perseo-it, aura-2-dionisio-it, aura-2-demetra-it

Full list of models: https://developers.deepgram.com/docs/tts-models

Optional -- Advanced:

Variable Description Default
DEEPGRAM_KEEPALIVE_INTERVAL Keep-alive ping interval in milliseconds 5000
AMI_URL URL of the AMI service used by call-control tools (avr_transfer, avr_hangup) http://127.0.0.1:6006

4. Running the Application

Start the application by running the following command:

node index.js

The server will start on the port defined in the environment variable (default: 6033).

How It Works

The Agent Voice Response system integrates with Deepgram's Speech-to-Speech API to provide intelligent audio-based responses to user queries. The server receives audio input from users, forwards it to Deepgram's API, and then returns the model's response as audio in real-time using WebSocket communication.

Key Components

  • Express.js Server: Handles incoming audio streams from clients
  • WebSocket Communication: Manages real-time communication with Deepgram's API
  • Audio Processing: Handles audio format conversion and streaming
  • Real-time Streaming: Processes and streams audio data in real-time

Audio Processing

The application is configured to work with:

  • Input Audio: 16-bit PCM at 8kHz
  • Output Audio: 16-bit PCM at 8kHz
  • Encoding: Linear16 format

API Endpoints

POST /speech-to-speech-stream

This endpoint accepts an audio stream and returns a streamed audio response generated by Deepgram.

Request:

  • Content-Type: audio/x-raw
  • Format: 16-bit PCM at 8kHz
  • Method: POST

Response:

  • Content-Type: text/event-stream
  • Format: 16-bit PCM at 8kHz
  • Streamed audio data in real-time

Customizing the Application

See the Environment Variables section above for the full list of configurable options.

Error Handling

The application includes comprehensive error handling for:

  • WebSocket connection issues
  • Audio processing errors
  • Deepgram API errors
  • Stream processing errors

All errors are logged to the console and appropriate error messages are returned to the client.

Contributors

We would like to express our gratitude to all the contributors who have helped make this project possible:

Support & Community

Support AVR

AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.

Support us on Ko-fi

License

MIT License - see the LICENSE file for details.

About

This repository showcases the integration between Agent Voice Response and Deepgram's Speech-to-Speech API. The application leverages Deepgram's powerful speech processing capabilities to provide intelligent, context-aware responses in real-time audio format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors