A real-time gesture-controlled audio synthesizer that uses computer vision to track hand movements and generate sound through FM synthesis. Built with Web Audio API, MediaPipe Hands, and SharedArrayBuffer for ultra-low latency performance.
This application transforms hand gestures captured via webcam into musical sound using frequency modulation (FM) synthesis. The system processes hand position, pinch gestures, and rotation in real-time to control audio parameters, creating an intuitive gesture-to-sound mapping interface.
┌─────────────────────────────────────────────────────┐
│ MAIN THREAD │
│ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Camera │───▶│ MediaPipe Hands │ │
│ │ (WebRTC) │ │ (WebGL accelerated) │ │
│ └─────────────┘ └───────────┬─────────────┘ │
│ │ │
│ Writes @ 30Hz│ │
│ ▼ │
│ ╔═══════════════════════════════════╗ │
│ ║ SharedArrayBuffer (32B) ║ │
│ ║ [X1,Y1,P1,R1,X2,Y2,P2,R2] ║ │
│ ╚═══════════════════════════════════╝ │
│ │
└─────────────────────────────────────────────────────┘
▲
│ Reads @ 44.1kHz
│ (lock-free)
┌────────────┴────────────┐
│ AudioWorklet │ ◀── Separate Thread
│ (FM Synthesis) │
└─────────────────────────┘
- Camera Capture: WebRTC captures video at 30fps
- Hand Tracking: MediaPipe Hands processes frames on the main thread using WebGL acceleration
- Gesture Analysis: Extracts wrist position (X,Y), pinch distance, and hand rotation
- Data Transfer: Writes gesture data to SharedArrayBuffer (or postMessage fallback)
- Audio Synthesis: AudioWorklet reads gesture data at 44.1kHz sample rate
- Sound Generation: FM synthesis creates audio based on gesture parameters
The 32-byte SharedArrayBuffer contains 8 Float32 values:
| Index | Parameter | Description |
|---|---|---|
| 0 | Hand1_X | Wrist X position (0.0-1.0) |
| 1 | Hand1_Y | Wrist Y position (0.0-1.0) |
| 2 | Hand1_Pinch | Pinch strength (0.0-1.0) |
| 3 | Hand1_Rotation | Hand rotation in radians |
| 4 | Hand2_X | Second hand X position |
| 5 | Hand2_Y | Second hand Y position |
| 6 | Hand2_Pinch | Second hand pinch strength |
| 7 | Hand2_Rotation | Second hand rotation |
- FM Synthesis: Frequency modulation with gesture-controlled parameters
- Multi-Timbral: Support for up to 2 simultaneous hands
- Dynamic Range: Pinch gestures control audio gain (0-25%)
- Smooth Transitions: Built-in smoothing prevents audio artifacts
- Pitch Control: Hand Y-position maps to frequency (200Hz-800Hz range)
- Modulation: Hand X-position controls modulation frequency
- Volume: Pinch gesture between thumb and index finger controls gain
- Rotation: Hand rotation angle available for future modulation
- SharedArrayBuffer: Lock-free communication between threads
- AudioWorklet: Dedicated audio processing thread at 44.1kHz
- WebGL Acceleration: GPU-accelerated computer vision
- Fallback Mode: postMessage compatibility for non-isolated contexts
- Live Video Feed: Real-time camera preview with hand landmark overlay
- Status Monitoring: Visual indicators for all system components
- Data Console: Real-time gesture parameter display
- Cross-Origin Isolation: Automatic detection and guidance
- Modern web browser with WebRTC support
- Webcam access
- HTTPS or localhost for camera permissions
-
Clone the repository
git clone https://github.com/intermosh/kinetograf-core cd kinetograf-core -
Start the server (choose one option):
Option A: Node.js
node server.js
Option B: Python
pip install flask python server.py
-
Open in browser
- Navigate to:
http://127.0.0.1:5500 - Important: Use
127.0.0.1, notlocalhost - Hard refresh:
Ctrl+Shift+R(orCmd+Shift+Ron Mac) - Or try incognito/private browsing mode
- Navigate to:
-
Grant permissions
- Allow camera access when prompted
- Click "▶ Initialize Engine"
-
Start gesturing
- Make a pinch gesture to produce sound
- Move hand up/down to change pitch
- Move hand left/right for modulation
For optimal zero-latency performance, the application requires cross-origin isolation:
- COOP (Cross-Origin-Opener-Policy):
same-origin - COEP (Cross-Origin-Embedder-Policy):
credentialless
The provided servers automatically set these headers. If SharedArrayBuffer is unavailable, the app falls back to postMessage communication with ~1-3ms additional latency.
// FM Synthesis with gesture control
const baseFreq = 200 + (1 - handY) * 600; // 200-800Hz from Y position
const modFreq = 2 + handX * 8; // 2-10Hz modulation from X
const gain = handPinch * 0.25; // 0-25% gain from pinch
// Generate sample
const modulator = Math.sin(phase2 * 2 * Math.PI);
const freqMod = baseFreq + modulator * 30 * handX;
const sample = Math.sin(phase * 2 * Math.PI) * gain;- Detection: MediaPipe Hands model processes RGB frames
- Landmarks: 21 3D hand landmarks extracted per frame
- Gesture Analysis:
- Wrist position: Landmark 0
- Pinch: Distance between thumb tip (4) and index tip (8)
- Rotation: Angle between index MCP (5) and pinky MCP (17)
- Latency: <10ms end-to-end (SharedArrayBuffer mode)
- CPU Usage: ~20-30% on modern hardware
- Memory: ~50MB (MediaPipe model + video buffers)
- Frame Rate: 30fps video, 44.1kHz audio
gesture-engine-shared-buffer/
├── index.html # Main application (HTML/CSS/JS)
├── server.js # Node.js development server
├── server.py # Python Flask development server
├── requirements.txt # Python dependencies
└── __pycache__/ # Python bytecode cache
- Frontend: Vanilla JavaScript ES6 modules
- Styling: Tailwind CSS (CDN)
- Computer Vision: MediaPipe Hands
- Audio: Web Audio API + AudioWorklet
- Concurrency: SharedArrayBuffer
- Servers: Node.js HTTP, Python Flask
- Optimal: Chrome 91+, Edge 91+ (SharedArrayBuffer support)
- Fallback: Firefox, Safari (postMessage mode)
- Requirements: WebRTC, WebGL, AudioWorklet
- Camera: Green when video stream active
- MediaPipe: Green when hand tracking initialized
- Audio Worklet: Green when synthesis engine running
- SharedArrayBuffer: Green when zero-latency mode active
- Real-time gesture data logging
- System initialization messages
- Performance warnings and errors
- Cross-origin isolation status
- Live parameter display (X, Y, Pinch, Rotation)
- Hand landmark overlay on video feed
- Pinch gesture visualization (connecting line)
- Multi-Hand Synthesis: Independent voices for each hand
- Advanced Gestures: More complex gesture recognition
- Audio Effects: Reverb, delay, filtering
- MIDI Output: Standard MIDI protocol support
- Recording: Audio capture and export
- Machine Learning: Gesture-to-sound model training
This project is open source. See individual component licenses for MediaPipe and Web Audio API usage terms.
Contributions welcome! Areas of interest:
- Audio synthesis algorithms
- Gesture recognition improvements
- Performance optimizations
- Cross-platform compatibility
- Documentation and examples
- MediaPipe: Google's computer vision framework
- Web Audio API: W3C audio processing standard
- SharedArrayBuffer: Modern web concurrency primitive
- Tailwind CSS: Utility-first CSS framework