Multimodal Demo/PoC

A streamlined multimodal AI application showcasing OpenAI's latest API capabilities, including text chat, image analysis, audio transcription, and realtime voice conversation.

Features

Text Response: Interactive chat interface powered by GPT-4o-mini
Image Analysis: Upload and analyze images using GPT-4o-mini Vision
Audio Transcription: Upload audio files for transcription using Whisper
Voice Conversation (Realtime): Full-duplex audio conversation with GPT-4o-realtime (integrated directly in Streamlit)

Tech Stack

Frontend: Streamlit with embedded WebRTC component for realtime audio
Backend: FastAPI with OpenAI API integration
APIs: OpenAI (GPT-4o-mini, GPT-4o-realtime, Whisper)
Audio Handling: WebRTC for realtime voice conversation
Image Processing: Streamlit file uploader and camera input

Project Structure

multimodal/
├── backend/                    # FastAPI backend services
│   ├── main.py                # Main FastAPI application with all endpoints
│   ├── requirements.txt       # Backend dependencies
│   └── assets/                # Sample assets for testing
│       ├── assistant-audio.wav
│       └── user-image.png
├── frontend/                   # Streamlit frontend application
│   ├── app.py                # Main Streamlit app (5 tabs)
│   ├── components/           # Reusable UI components
│   │   └── realtime_conversation.html  # Embedded WebRTC component
│   └── requirements.txt      # Frontend dependencies
└── README.md                 # This documentation

Prerequisites

Python 3.8+
OpenAI API key with access to:
- GPT-4o-mini
- GPT-4o-realtime-preview
- Whisper (gpt-4o-transcribe)
Modern web browser with WebRTC support
Microphone and speakers for voice features

Setup

Navigate to the multimodal directory:

cd multimodal

Set up environment variables:

export OPENAI_API_KEY="your_api_key_here"

Install backend dependencies:

cd backend
pip install -r requirements.txt

Install frontend dependencies:

cd ../frontend
pip install -r requirements.txt

Running the Application

Start the backend server:

cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

In a new terminal, start the frontend:

cd frontend
streamlit run app.py --server.port 8501

Open your browser and navigate to http://localhost:8501

Usage

Text Response

Type your message in the text area
Click "Submit" to get AI-powered responses from GPT-4o-mini
View the conversation history

Image Analysis

Upload an image (JPG, JPEG, or PNG) and enter a prompt
Get detailed visual analysis using GPT-4o-mini Vision

Voice Conversation (Realtime)

Navigate to the "Voice Conversation (Realtime)" tab
Click "Start Session" to begin
Allow microphone access when prompted
Speak naturally or type messages
Click "Stop Session" when done
Features full-duplex audio conversation with GPT-4o-realtime

Audio Transcription

Upload audio files (WAV, MP3, M4A) for transcription
Get accurate transcriptions using Whisper
View results in real-time

API Endpoints

Backend (FastAPI)

POST /generate-response: Text chat with GPT-4o-mini
POST /analyze-image: Image analysis with GPT-4o-mini Vision
POST /transcribe-audio: Audio transcription with Whisper
GET /token: Generate session tokens for realtime API
GET /docs: API documentation (auto-generated by FastAPI)

Sample Assets

The backend includes sample assets for testing:

backend/assets/user-image.png: Sample image for testing image analysis
backend/assets/assistant-audio.wav: Sample audio file for testing transcription

Architecture

Frontend: Streamlit with embedded HTML components
Backend: FastAPI serving as pure API server
Realtime: WebRTC component embedded in Streamlit
Communication: Direct API calls from frontend to backend

Development

Adding New Features

Backend: Add new endpoints in backend/main.py
Frontend: Create new tabs in frontend/app.py

Testing

Manual testing through the Streamlit interface
API testing through FastAPI's auto-generated docs at /docs

Related Projects

globomantics-eval: Separate evaluation framework for automated testing
globomantics-lite: Simplified version with basic functionality

License

MIT License

Acknowledgments

OpenAI for providing the multimodal APIs
Streamlit for the frontend framework
FastAPI for the backend framework
WebRTC for realtime audio capabilities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Demo/PoC

Features

Tech Stack

Project Structure

Prerequisites

Setup

Running the Application

Usage

Text Response

Image Analysis

Voice Conversation (Realtime)

Audio Transcription

API Endpoints

Backend (FastAPI)

Sample Assets

Architecture

Development

Adding New Features

Testing

Related Projects

License

Acknowledgments

multimodal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend		backend
frontend		frontend
README.md		README.md
image.png		image.png
multimodal.code-workspace		multimodal.code-workspace

Folders and files

Latest commit

History

Repository files navigation

Multimodal Demo/PoC

Features

Tech Stack

Project Structure

Prerequisites

Setup

Running the Application

Usage

Text Response

Image Analysis

Voice Conversation (Realtime)

Audio Transcription

API Endpoints

Backend (FastAPI)

Sample Assets

Architecture

Development

Adding New Features

Testing

Related Projects

License

Acknowledgments

multimodal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages