This repository showcases the integration between Agent Voice Response and Gemini Live Speech-to-Speech API. The application leverages Gemini's powerful language model to process audio input from users, providing intelligent, context-aware responses in real-time audio format with automatic audio format conversion.
- Real-time Streaming: WebSocket-based audio streaming with Gemini Live API
- Audio Format Conversion: Automatic conversion between 8kHz, 16kHz, and 24kHz sample rates
- Gemini Integration: Direct integration with Google's Gemini Live Speech-to-Speech API
- Configurable Audio Settings: Customizable sample rates and buffer sizes
- Session Management: Efficient WebSocket session handling for continuous audio streaming
Copy .env.example to .env and configure the following variables:
Required:
GEMINI_API_KEY: Google API key with access to Gemini Live
Optional:
PORT (default: 6037)
GEMINI_MODEL (default: gemini-2.5-flash-preview-native-audio-dialog)
GEMINI_INSTRUCTIONS (system prompt)
# Choose one of the following instruction loading methods:
GEMINI_INSTRUCTIONS="You are a helpful assistant that can answer questions and help with tasks." # Method 1: Direct variable
#GEMINI_URL_INSTRUCTIONS="https://your-api.com/instructions" # Method 2: Web service
#GEMINI_FILE_INSTRUCTIONS="./instructions.txt" # Method 3: Local file
We’ve added support for the following Gemini settings:
GEMINI_THINKING_LEVEL=MINIMALGEMINI_THINKING_BUDGET=0
More details here 👉 https://ai.google.dev/gemini-api/docs/thinking?hl=en
Supported values for GEMINI_THINKING_LEVEL:
THINKING_LEVEL_UNSPECIFIEDLOWMEDIUMHIGHMINIMAL
GEMINI_THINKING_BUDGET:
0→ turn off thinking-1→ enable dynamic thinking
npm install
npm startSend a POST request to /speech-to-speech-stream with:
-
Headers:
x-uuid: Unique identifier for the callContent-Type:audio/pcmor appropriate audio format
-
Body: Raw audio data stream (8kHz PCM recommended)
The service returns a stream of audio data from Gemini. The response includes:
- Real-time audio chunks from the AI
- Audio format conversion (8kHz ↔ 16kHz ↔ 24kHz)
- Session management and WebSocket communication
The application supports three different methods for loading AI instructions, with a specific priority order:
Set the GEMINI_INSTRUCTIONS environment variable with your custom instructions:
GEMINI_INSTRUCTIONS="You are a specialized customer service agent for a tech company. Always be polite and helpful."If no environment variable is set, the application can fetch instructions from a web service using the GEMINI_URL_INSTRUCTIONS environment variable:
GEMINI_URL_INSTRUCTIONS="https://your-api.com/instructions"The web service should return a JSON response with a system field containing the instructions:
{
"system": "You are a helpful assistant that provides technical support."
}The application will include the session UUID in the request headers as X-AVR-UUID for personalized instructions.
If neither environment variable nor web service is configured, the application can load instructions from a local file using the GEMINI_FILE_INSTRUCTIONS environment variable:
GEMINI_FILE_INSTRUCTIONS="./instructions.txt"The file should contain plain text instructions that will be used as the system prompt.
The application checks for instructions in the following order:
- Environment Variable (
GEMINI_INSTRUCTIONS) - Used if set - Web Service (
GEMINI_URL_INSTRUCTIONS) - Used if environment variable is not set - File (
GEMINI_FILE_INSTRUCTIONS) - Used if neither environment variable nor web service is configured - Default Instructions - Fallback if none of the above are available
This priority system allows for flexible configuration where you can override instructions at different levels depending on your deployment needs.
The service handles various error scenarios:
- Missing required environment variables
- Invalid API responses
- WebSocket connection failures
- Audio processing errors
- Audio format conversion issues
# Build the image
docker build -t avr-sts-gemini .
# Run with environment file
docker run --env-file .env -p 6037:6037 avr-sts-gemini- GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
- Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
- Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
- Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.
AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.
MIT License - see the LICENSE file for details.