A streamlined multimodal AI application showcasing OpenAI's latest API capabilities, including text chat, image analysis, audio transcription, and realtime voice conversation.
- Text Response: Interactive chat interface powered by GPT-4o-mini
- Image Analysis: Upload and analyze images using GPT-4o-mini Vision
- Audio Transcription: Upload audio files for transcription using Whisper
- Voice Conversation (Realtime): Full-duplex audio conversation with GPT-4o-realtime (integrated directly in Streamlit)
- Frontend: Streamlit with embedded WebRTC component for realtime audio
- Backend: FastAPI with OpenAI API integration
- APIs: OpenAI (GPT-4o-mini, GPT-4o-realtime, Whisper)
- Audio Handling: WebRTC for realtime voice conversation
- Image Processing: Streamlit file uploader and camera input
multimodal/
├── backend/ # FastAPI backend services
│ ├── main.py # Main FastAPI application with all endpoints
│ ├── requirements.txt # Backend dependencies
│ └── assets/ # Sample assets for testing
│ ├── assistant-audio.wav
│ └── user-image.png
├── frontend/ # Streamlit frontend application
│ ├── app.py # Main Streamlit app (5 tabs)
│ ├── components/ # Reusable UI components
│ │ └── realtime_conversation.html # Embedded WebRTC component
│ └── requirements.txt # Frontend dependencies
└── README.md # This documentation
- Python 3.8+
- OpenAI API key with access to:
- GPT-4o-mini
- GPT-4o-realtime-preview
- Whisper (gpt-4o-transcribe)
- Modern web browser with WebRTC support
- Microphone and speakers for voice features
- Navigate to the multimodal directory:
cd multimodal- Set up environment variables:
export OPENAI_API_KEY="your_api_key_here"- Install backend dependencies:
cd backend
pip install -r requirements.txt- Install frontend dependencies:
cd ../frontend
pip install -r requirements.txt- Start the backend server:
cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000- In a new terminal, start the frontend:
cd frontend
streamlit run app.py --server.port 8501- Open your browser and navigate to
http://localhost:8501
- Type your message in the text area
- Click "Submit" to get AI-powered responses from GPT-4o-mini
- View the conversation history
- Upload an image (JPG, JPEG, or PNG) and enter a prompt
- Get detailed visual analysis using GPT-4o-mini Vision
- Navigate to the "Voice Conversation (Realtime)" tab
- Click "Start Session" to begin
- Allow microphone access when prompted
- Speak naturally or type messages
- Click "Stop Session" when done
- Features full-duplex audio conversation with GPT-4o-realtime
- Upload audio files (WAV, MP3, M4A) for transcription
- Get accurate transcriptions using Whisper
- View results in real-time
POST /generate-response: Text chat with GPT-4o-miniPOST /analyze-image: Image analysis with GPT-4o-mini VisionPOST /transcribe-audio: Audio transcription with WhisperGET /token: Generate session tokens for realtime APIGET /docs: API documentation (auto-generated by FastAPI)
The backend includes sample assets for testing:
backend/assets/user-image.png: Sample image for testing image analysisbackend/assets/assistant-audio.wav: Sample audio file for testing transcription
- Frontend: Streamlit with embedded HTML components
- Backend: FastAPI serving as pure API server
- Realtime: WebRTC component embedded in Streamlit
- Communication: Direct API calls from frontend to backend
- Backend: Add new endpoints in
backend/main.py - Frontend: Create new tabs in
frontend/app.py
- Manual testing through the Streamlit interface
- API testing through FastAPI's auto-generated docs at
/docs
- globomantics-eval: Separate evaluation framework for automated testing
- globomantics-lite: Simplified version with basic functionality
MIT License
