Agent Voice Response - Gemini Speech-to-Speech Integration

This repository showcases the integration between Agent Voice Response and Gemini Live Speech-to-Speech API. The application leverages Gemini's powerful language model to process audio input from users, providing intelligent, context-aware responses in real-time audio format with automatic audio format conversion.

Features

Real-time Streaming: WebSocket-based audio streaming with Gemini Live API
Audio Format Conversion: Automatic conversion between 8kHz, 16kHz, and 24kHz sample rates
Gemini Integration: Direct integration with Google's Gemini Live Speech-to-Speech API
Configurable Audio Settings: Customizable sample rates and buffer sizes
Session Management: Efficient WebSocket session handling for continuous audio streaming

Configuration

Environment Variables

Copy .env.example to .env and configure the following variables:

Required:

GEMINI_API_KEY: Google API key with access to Gemini Live

Optional:

PORT (default: 6037)
GEMINI_MODEL (default: gemini-2.5-flash-preview-native-audio-dialog)
GEMINI_INSTRUCTIONS (system prompt)

# Choose one of the following instruction loading methods:
GEMINI_INSTRUCTIONS="You are a helpful assistant that can answer questions and help with tasks."  # Method 1: Direct variable
#GEMINI_URL_INSTRUCTIONS="https://your-api.com/instructions"  # Method 2: Web service
#GEMINI_FILE_INSTRUCTIONS="./instructions.txt"  # Method 3: Local file

Gemini thinking settings

We’ve added support for the following Gemini settings:

GEMINI_THINKING_LEVEL=MINIMAL
GEMINI_THINKING_BUDGET=0

More details here 👉 https://ai.google.dev/gemini-api/docs/thinking?hl=en

Supported values for GEMINI_THINKING_LEVEL:

THINKING_LEVEL_UNSPECIFIED
LOW
MEDIUM
HIGH
MINIMAL

GEMINI_THINKING_BUDGET:

0 → turn off thinking
-1 → enable dynamic thinking

Usage

Starting the Server

npm install
npm start

Making Requests

Send a POST request to /speech-to-speech-stream with:

Headers:
- x-uuid: Unique identifier for the call
- Content-Type: audio/pcm or appropriate audio format
Body: Raw audio data stream (8kHz PCM recommended)

API Response

The service returns a stream of audio data from Gemini. The response includes:

Real-time audio chunks from the AI
Audio format conversion (8kHz ↔ 16kHz ↔ 24kHz)
Session management and WebSocket communication

Instruction Loading Methods

The application supports three different methods for loading AI instructions, with a specific priority order:

1. Environment Variable (Highest Priority)

Set the GEMINI_INSTRUCTIONS environment variable with your custom instructions:

GEMINI_INSTRUCTIONS="You are a specialized customer service agent for a tech company. Always be polite and helpful."

2. Web Service (Medium Priority)

If no environment variable is set, the application can fetch instructions from a web service using the GEMINI_URL_INSTRUCTIONS environment variable:

GEMINI_URL_INSTRUCTIONS="https://your-api.com/instructions"

The web service should return a JSON response with a system field containing the instructions:

{
  "system": "You are a helpful assistant that provides technical support."
}

The application will include the session UUID in the request headers as X-AVR-UUID for personalized instructions.

3. File (Lowest Priority)

If neither environment variable nor web service is configured, the application can load instructions from a local file using the GEMINI_FILE_INSTRUCTIONS environment variable:

GEMINI_FILE_INSTRUCTIONS="./instructions.txt"

The file should contain plain text instructions that will be used as the system prompt.

Priority Order

The application checks for instructions in the following order:

Environment Variable (GEMINI_INSTRUCTIONS) - Used if set
Web Service (GEMINI_URL_INSTRUCTIONS) - Used if environment variable is not set
File (GEMINI_FILE_INSTRUCTIONS) - Used if neither environment variable nor web service is configured
Default Instructions - Fallback if none of the above are available

This priority system allows for flexible configuration where you can override instructions at different levels depending on your deployment needs.

Error Handling

The service handles various error scenarios:

Missing required environment variables
Invalid API responses
WebSocket connection failures
Audio processing errors
Audio format conversion issues

Docker Support

# Build the image
docker build -t avr-sts-gemini .

# Run with environment file
docker run --env-file .env -p 6037:6037 avr-sts-gemini

Support & Community

GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.

Support AVR

AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.

License

MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
avr_tools		avr_tools
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
index.js		index.js
instructions.txt		instructions.txt
loadTools.js		loadTools.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Voice Response - Gemini Speech-to-Speech Integration

Features

Configuration

Environment Variables

Gemini thinking settings

Usage

Starting the Server

Making Requests

API Response

Instruction Loading Methods

1. Environment Variable (Highest Priority)

2. Web Service (Medium Priority)

3. File (Lowest Priority)

Priority Order

Error Handling

Docker Support

Support & Community

Support AVR

License

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Voice Response - Gemini Speech-to-Speech Integration

Features

Configuration

Environment Variables

Gemini thinking settings

Usage

Starting the Server

Making Requests

API Response

Instruction Loading Methods

1. Environment Variable (Highest Priority)

2. Web Service (Medium Priority)

3. File (Lowest Priority)

Priority Order

Error Handling

Docker Support

Support & Community

Support AVR

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages