Skip to content

rameshagowda/multimodal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Demo/PoC

A streamlined multimodal AI application showcasing OpenAI's latest API capabilities, including text chat, image analysis, audio transcription, and realtime voice conversation.

Features

  • Text Response: Interactive chat interface powered by GPT-4o-mini
  • Image Analysis: Upload and analyze images using GPT-4o-mini Vision
  • Audio Transcription: Upload audio files for transcription using Whisper
  • Voice Conversation (Realtime): Full-duplex audio conversation with GPT-4o-realtime (integrated directly in Streamlit)

Tech Stack

  • Frontend: Streamlit with embedded WebRTC component for realtime audio
  • Backend: FastAPI with OpenAI API integration
  • APIs: OpenAI (GPT-4o-mini, GPT-4o-realtime, Whisper)
  • Audio Handling: WebRTC for realtime voice conversation
  • Image Processing: Streamlit file uploader and camera input

Project Structure

multimodal/
├── backend/                    # FastAPI backend services
│   ├── main.py                # Main FastAPI application with all endpoints
│   ├── requirements.txt       # Backend dependencies
│   └── assets/                # Sample assets for testing
│       ├── assistant-audio.wav
│       └── user-image.png
├── frontend/                   # Streamlit frontend application
│   ├── app.py                # Main Streamlit app (5 tabs)
│   ├── components/           # Reusable UI components
│   │   └── realtime_conversation.html  # Embedded WebRTC component
│   └── requirements.txt      # Frontend dependencies
└── README.md                 # This documentation

Prerequisites

  • Python 3.8+
  • OpenAI API key with access to:
    • GPT-4o-mini
    • GPT-4o-realtime-preview
    • Whisper (gpt-4o-transcribe)
  • Modern web browser with WebRTC support
  • Microphone and speakers for voice features

Setup

  1. Navigate to the multimodal directory:
cd multimodal
  1. Set up environment variables:
export OPENAI_API_KEY="your_api_key_here"
  1. Install backend dependencies:
cd backend
pip install -r requirements.txt
  1. Install frontend dependencies:
cd ../frontend
pip install -r requirements.txt

Running the Application

  1. Start the backend server:
cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000
  1. In a new terminal, start the frontend:
cd frontend
streamlit run app.py --server.port 8501
  1. Open your browser and navigate to http://localhost:8501

Usage

Text Response

  • Type your message in the text area
  • Click "Submit" to get AI-powered responses from GPT-4o-mini
  • View the conversation history

Image Analysis

  • Upload an image (JPG, JPEG, or PNG) and enter a prompt
  • Get detailed visual analysis using GPT-4o-mini Vision

Voice Conversation (Realtime)

  • Navigate to the "Voice Conversation (Realtime)" tab
  • Click "Start Session" to begin
  • Allow microphone access when prompted
  • Speak naturally or type messages
  • Click "Stop Session" when done
  • Features full-duplex audio conversation with GPT-4o-realtime

Audio Transcription

  • Upload audio files (WAV, MP3, M4A) for transcription
  • Get accurate transcriptions using Whisper
  • View results in real-time

API Endpoints

Backend (FastAPI)

  • POST /generate-response: Text chat with GPT-4o-mini
  • POST /analyze-image: Image analysis with GPT-4o-mini Vision
  • POST /transcribe-audio: Audio transcription with Whisper
  • GET /token: Generate session tokens for realtime API
  • GET /docs: API documentation (auto-generated by FastAPI)

Sample Assets

The backend includes sample assets for testing:

  • backend/assets/user-image.png: Sample image for testing image analysis
  • backend/assets/assistant-audio.wav: Sample audio file for testing transcription

Architecture

  • Frontend: Streamlit with embedded HTML components
  • Backend: FastAPI serving as pure API server
  • Realtime: WebRTC component embedded in Streamlit
  • Communication: Direct API calls from frontend to backend

Development

Adding New Features

  1. Backend: Add new endpoints in backend/main.py
  2. Frontend: Create new tabs in frontend/app.py

Testing

  • Manual testing through the Streamlit interface
  • API testing through FastAPI's auto-generated docs at /docs

Related Projects

  • globomantics-eval: Separate evaluation framework for automated testing
  • globomantics-lite: Simplified version with basic functionality

License

MIT License

Acknowledgments

  • OpenAI for providing the multimodal APIs
  • Streamlit for the frontend framework
  • FastAPI for the backend framework
  • WebRTC for realtime audio capabilities alt text

multimodal

About

Multimodal app development

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors