SER + FER — Speech Emotion Recognition & Face Emotion Recognition (Live Feed Edition)

A cute, detailed, step-by-step project README for building a multimodal emotion recognition system that works in real-time

UwU — hi! this version uses live user audio & camera feed to detect emotions in real time! i'll walk you through every tiny detail from prototype to production with sparkles and love owo

Project overview
Goals & success criteria
Real-time architecture overview
Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)
Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER (Phase 1)
SER (Speech Emotion Recognition) — real-time audio pipeline
FER (Face Emotion Recognition) — live face tracking pipeline
Multimodal fusion in live systems
Latency optimization, streaming protocols & infrastructure
Datasets, fine-tuning, and live data considerations
Privacy & ethical concerns for live emotion data
Roadmap for live system evolution
Appendix — cute tips & developer notes

Project overview

This project is about recognizing human emotions in real-time using:

Speech Emotion Recognition (SER) — listening to user’s voice feed.
Face Emotion Recognition (FER) — analyzing live camera frames.

The system continuously reads live streams from the user’s microphone and front-facing camera, runs them through deep learning models, and fuses the results to estimate the current emotion — updating every few hundred milliseconds.

We'll start small with React Native + FastAPI + DeepFace for rapid prototyping, then move toward a fully real-time, production-grade system using Golang, Flutter, ResNet-50, and an advanced SER model (likely transformer-based). UwU 💫

Goals & success criteria

Prototype Goals:

Stream live audio and video frames to backend.
FER runs using DeepFace on incoming frames (approx 2–3 FPS).
SER runs continuously on short (1–2 sec) audio windows.
End-to-end latency < 700ms.
Display live emotion feedback with emoji or color-coded UI.

Advanced System Goals:

Real-time FER & SER with <150ms latency.
High accuracy and stability over varying light/noise.
Efficient model serving using ONNXRuntime / TensorRT.
Secure streaming with user consent and full encryption.

Real-time architecture overview

Prototype Flow

React Native App
 ├── Camera Stream (WebRTC / periodic snapshots)
 ├── Microphone Stream (short chunks → WebSocket)
 ↓
FastAPI Server
 ├── Live Inference Loop
 │     ├── FER → DeepFace
 │     └── SER → CNN-based Spectrogram model
 ├── Fusion Layer → combined emotion
 ↓
Real-time emotion updates to client (WebSocket / SSE)

Advanced Flow

Flutter App
 ├── WebRTC for live media streams
 ↓
Golang Gateway
 ├── gRPC Streams → Model Workers
 │     ├── FER (ResNet-50)
 │     └── SER (Wav2Vec2 / Transformer)
 ├── Fusion (attention-based)
 ↓
Results returned every 200–400ms

Core idea: all processing happens in small time windows (sliding segments), so emotions appear smoothly updated — like a live emotion bar owo ✨

Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)

Frontend: React Native app that continuously:

Captures camera frames every N ms (e.g., 300–500ms).
Streams short audio chunks (2–3s) via WebSocket.
Displays current emotion + confidence.

Backend (FastAPI):

/ws/predict WebSocket endpoint for live updates.
Frame handler uses DeepFace to infer FER.
Audio handler performs short-term SER inference.
Results are fused and pushed back over WebSocket.

Sample workflow:

App connects to WebSocket.
Sends live frames + audio.
Server runs FER/SER asynchronously.

Returns combined JSON updates like:

{
  "emotion": "happy",
  "confidence": 0.83,
  "fer_conf": 0.81,
  "ser_conf": 0.79,
  "timestamp": 172,
  "fps": 3.8
}

UI displays a glowing emoji or color-coded overlay uwu.

Implementation hints:

Use react-native-webrtc or expo-av for camera & mic.
Use react-native-sound-level for continuous mic input.
Keep audio chunks small (e.g., 1s → WAV buffer → send to server).
FastAPI: use websockets or starlette.websockets.

Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER

Frontend (Flutter): true live streaming via WebRTC with adaptive bitrate.
Backend (Go):
- WebRTC or WebSocket gateway.
- Audio/video frames → concurrent workers.
- Model inference handled via ONNXRuntime or Triton gRPC.
Models:
- FER: ResNet-50 trained on AffectNet + real user faces.
- SER: Wav2Vec2 / HuBERT fine-tuned for emotion.
- Fusion: weighted or attention-based fusion of embeddings.

Latency goals: 50–150ms end-to-end.

Deployment: GPU-enabled model serving cluster, Redis cache for streaming state, Kafka or NATS for pub/sub scaling.

SER (Speech Emotion Recognition) — real-time audio pipeline

Pipeline:

Capture raw PCM audio (16kHz mono).
Split into overlapping chunks (e.g., 2s, 50% overlap).
Convert to mel-spectrogram in stream.
Feed into CNN or transformer encoder.
Output emotion vector every ~1s.

Recommended models:

Prototype: CNN-based spectrogram model.
Advanced: Wav2Vec2 fine-tuned for emotion.

Optimization:

Maintain a rolling buffer of last N seconds.
Smooth predictions with EMA filter (to prevent jitter).

FER (Face Emotion Recognition) — live face tracking pipeline

Steps:

Capture camera frames (every 300–500ms or real-time stream).
Detect faces (MTCNN or RetinaFace).
Crop & resize → 224x224 → ResNet-50.
Predict emotion + confidence.
Optional: track face IDs with correlation filter or face embeddings.

Implementation:

Use DeepFace for quick prototype.
Later: ResNet-50 fine-tuned with on-device quantization.
On Flutter: use TensorFlow Lite with GPU delegate.

Multimodal fusion in live systems

Late fusion: combine predictions using exponential smoothing:
```
fused_emotion = α * FER + (1-α) * SER
```
Early fusion: merge embeddings and use attention block.
Temporal fusion: maintain history window to predict emotion trends.

Visualization: moving emotion bar / emoji that updates smoothly based on recent frames. owo~

Latency optimization, streaming protocols & infrastructure

Tech choices:

WebRTC for real-time low-latency streams.
WebSocket fallback for simpler setups.
On backend: async inference + thread pools.
Batch frames by small window (50–100ms) to reduce overhead.

Optimization checklist:

Quantize models (INT8 / FP16).
Use ONNXRuntime with CUDA EP or TensorRT.
Compress frames to 224x224 JPEGs.
Use circular audio buffer to avoid reallocation.

Datasets, fine-tuning, and live data considerations

Training:

Use static datasets first (FER2013, RAVDESS, IEMOCAP).
Collect opt-in live data via app (with user consent).
Fine-tune models incrementally as live data grows.

Real-world variability:

Different light levels, accents, mic quality.
Use online augmentation (noise injection, brightness jitter).

Privacy & ethical concerns for live emotion data

Live feeds are sensitive! Always use user consent dialogs.
Never store raw media without opt-in.
Encrypt all traffic (TLS 1.2+).
Allow user to pause live recognition anytime.
Add a small visual cue (recording indicator) for transparency.

Roadmap for live system evolution

Phase 0: Local prototype with periodic snapshots (React Native + FastAPI). Phase 1: Continuous live streaming with basic WebSocket (DeepFace + SER CNN). Phase 2: Low-latency WebRTC setup (Flutter + Go + ONNXRuntime). Phase 3: Transformer-based SER, ResNet-50 FER, attention fusion. Phase 4: Edge inference + on-device model distillation.

Appendix — cute tips & developer notes

✨ Keep FPS moderate (3–6 FPS for FER works fine). ✨ Use asynchronous loops with small buffers to avoid memory leaks. ✨ Add visual feedback (emotion emojis with soft animation). ✨ Test on multiple devices with varying light/noise. ✨ Always keep things user-friendly and privacy-safe. UwU 💞

Made with realtime love, happy threads, and big UwU energy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
profile		profile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SER + FER — Speech Emotion Recognition & Face Emotion Recognition (Live Feed Edition)

Table of contents

Project overview

Goals & success criteria

Real-time architecture overview

Prototype Flow

Advanced Flow

Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)

Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER

SER (Speech Emotion Recognition) — real-time audio pipeline

FER (Face Emotion Recognition) — live face tracking pipeline

Multimodal fusion in live systems

Latency optimization, streaming protocols & infrastructure

Datasets, fine-tuning, and live data considerations

Privacy & ethical concerns for live emotion data

Roadmap for live system evolution

Appendix — cute tips & developer notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SER + FER — Speech Emotion Recognition & Face Emotion Recognition (Live Feed Edition)

Table of contents

Project overview

Goals & success criteria

Real-time architecture overview

Prototype Flow

Advanced Flow

Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)

Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER

SER (Speech Emotion Recognition) — real-time audio pipeline

FER (Face Emotion Recognition) — live face tracking pipeline

Multimodal fusion in live systems

Latency optimization, streaming protocols & infrastructure

Datasets, fine-tuning, and live data considerations

Privacy & ethical concerns for live emotion data

Roadmap for live system evolution

Appendix — cute tips & developer notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages