This repository provides a speech-to-text transcription service using Google's Gemini API integrated with the Agent Voice Response system. The code sets up an Express.js server that accepts audio input from Agent Voice Response Core, transcribes the audio using the Gemini API, and returns the transcription to the Agent Voice Response Core.
Before setting up the project, ensure you have the following:
- Node.js and npm installed.
- A Google Cloud account with the Vertex AI API enabled (which includes Gemini models). You can enable it in the Google Cloud Console.
- A Google Cloud API Key with the necessary permissions to access the Vertex AI API. You can create an API key by following the instructions in the Google Cloud documentation. Ensure the key is restricted appropriately for security. Alternatively, you can obtain an API key directly from Google AI Studio.
git clone https://github.com/agentvoiceresponse/avr-stt-gemini.git
cd avr-stt-gemininpm installCreate a .env file in the root directory of the project by copying the .env.example file:
cp .env.example .envThen, edit the .env file and replace xxxx with your actual Google Cloud API Key obtained in the prerequisites step.
Ensure that you have the following environment variables set in your .env file:
PORT=6021
GEMINI_API_KEY=your_google_api_key_here # Replace with your actual API key
You can adjust the port number as needed.
This application sets up an Express.js server that accepts audio input from clients and uses the Gemini API to transcribe the audio. The transcribed text is then returned to the Agent Voice Response Core. Below is an overview of the core components:
The server listens for audio input on a specific route (/transcribe) and passes the audio to the Gemini API for transcription.
The application processes the incoming audio data and prepares it for the Gemini API, ensuring it meets the required format and specifications.
The API processes the audio data received from the client and converts it into text using Gemini's advanced speech recognition capabilities.
This route accepts audio input from the client and transmits it for transcription. The transcription is sent back to the client once processing is complete.
Here's a high-level breakdown of the key parts of the code:
-
Server Setup: Configures the Express.js server and the Gemini API integration.
-
Audio Processing: A function that handles the incoming audio from clients. It:
- Processes the audio data to meet Gemini API requirements
- Sends the audio to the Gemini API for transcription
- Returns the transcribed text to the client
-
Express.js Route: The route
/transcribeprocesses the audio input and returns the transcription.
To start the application:
npm run startor
npm run start:devThe server will start and listen on the port specified in the .env file or default to PORT=6021.
You can send audio data to the /transcribe endpoint using a client that can send audio files or streams. Ensure that the audio format is compatible with the Gemini API requirements.