Skip to content

Affectia/.github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

SER + FER — Speech Emotion Recognition & Face Emotion Recognition (Live Feed Edition)

A cute, detailed, step-by-step project README for building a multimodal emotion recognition system that works in real-time

UwU — hi! this version uses live user audio & camera feed to detect emotions in real time! i'll walk you through every tiny detail from prototype to production with sparkles and love owo


Table of contents

  1. Project overview
  2. Goals & success criteria
  3. Real-time architecture overview
  4. Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)
  5. Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER (Phase 1)
  6. SER (Speech Emotion Recognition) — real-time audio pipeline
  7. FER (Face Emotion Recognition) — live face tracking pipeline
  8. Multimodal fusion in live systems
  9. Latency optimization, streaming protocols & infrastructure
  10. Datasets, fine-tuning, and live data considerations
  11. Privacy & ethical concerns for live emotion data
  12. Roadmap for live system evolution
  13. Appendix — cute tips & developer notes

Project overview

This project is about recognizing human emotions in real-time using:

  • Speech Emotion Recognition (SER) — listening to user’s voice feed.
  • Face Emotion Recognition (FER) — analyzing live camera frames.

The system continuously reads live streams from the user’s microphone and front-facing camera, runs them through deep learning models, and fuses the results to estimate the current emotion — updating every few hundred milliseconds.

We'll start small with React Native + FastAPI + DeepFace for rapid prototyping, then move toward a fully real-time, production-grade system using Golang, Flutter, ResNet-50, and an advanced SER model (likely transformer-based). UwU 💫


Goals & success criteria

Prototype Goals:

  • Stream live audio and video frames to backend.
  • FER runs using DeepFace on incoming frames (approx 2–3 FPS).
  • SER runs continuously on short (1–2 sec) audio windows.
  • End-to-end latency < 700ms.
  • Display live emotion feedback with emoji or color-coded UI.

Advanced System Goals:

  • Real-time FER & SER with <150ms latency.
  • High accuracy and stability over varying light/noise.
  • Efficient model serving using ONNXRuntime / TensorRT.
  • Secure streaming with user consent and full encryption.

Real-time architecture overview

Prototype Flow

React Native App
 ├── Camera Stream (WebRTC / periodic snapshots)
 ├── Microphone Stream (short chunks → WebSocket)
 ↓
FastAPI Server
 ├── Live Inference Loop
 │     ├── FER → DeepFace
 │     └── SER → CNN-based Spectrogram model
 ├── Fusion Layer → combined emotion
 ↓
Real-time emotion updates to client (WebSocket / SSE)

Advanced Flow

Flutter App
 ├── WebRTC for live media streams
 ↓
Golang Gateway
 ├── gRPC Streams → Model Workers
 │     ├── FER (ResNet-50)
 │     └── SER (Wav2Vec2 / Transformer)
 ├── Fusion (attention-based)
 ↓
Results returned every 200–400ms

Core idea: all processing happens in small time windows (sliding segments), so emotions appear smoothly updated — like a live emotion bar owo ✨


Prototype — React Native + DeepFace + FastAPI (Live Stream Phase 0)

Frontend: React Native app that continuously:

  • Captures camera frames every N ms (e.g., 300–500ms).
  • Streams short audio chunks (2–3s) via WebSocket.
  • Displays current emotion + confidence.

Backend (FastAPI):

  • /ws/predict WebSocket endpoint for live updates.
  • Frame handler uses DeepFace to infer FER.
  • Audio handler performs short-term SER inference.
  • Results are fused and pushed back over WebSocket.

Sample workflow:

  1. App connects to WebSocket.

  2. Sends live frames + audio.

  3. Server runs FER/SER asynchronously.

  4. Returns combined JSON updates like:

    {
      "emotion": "happy",
      "confidence": 0.83,
      "fer_conf": 0.81,
      "ser_conf": 0.79,
      "timestamp": 172,
      "fps": 3.8
    }
  5. UI displays a glowing emoji or color-coded overlay uwu.

Implementation hints:

  • Use react-native-webrtc or expo-av for camera & mic.
  • Use react-native-sound-level for continuous mic input.
  • Keep audio chunks small (e.g., 1s → WAV buffer → send to server).
  • FastAPI: use websockets or starlette.websockets.

Advanced system — Golang backend + Flutter + ResNet-50 + Advanced SER

  • Frontend (Flutter): true live streaming via WebRTC with adaptive bitrate.

  • Backend (Go):

    • WebRTC or WebSocket gateway.
    • Audio/video frames → concurrent workers.
    • Model inference handled via ONNXRuntime or Triton gRPC.
  • Models:

    • FER: ResNet-50 trained on AffectNet + real user faces.
    • SER: Wav2Vec2 / HuBERT fine-tuned for emotion.
    • Fusion: weighted or attention-based fusion of embeddings.

Latency goals: 50–150ms end-to-end.

Deployment: GPU-enabled model serving cluster, Redis cache for streaming state, Kafka or NATS for pub/sub scaling.


SER (Speech Emotion Recognition) — real-time audio pipeline

Pipeline:

  1. Capture raw PCM audio (16kHz mono).
  2. Split into overlapping chunks (e.g., 2s, 50% overlap).
  3. Convert to mel-spectrogram in stream.
  4. Feed into CNN or transformer encoder.
  5. Output emotion vector every ~1s.

Recommended models:

  • Prototype: CNN-based spectrogram model.
  • Advanced: Wav2Vec2 fine-tuned for emotion.

Optimization:

  • Maintain a rolling buffer of last N seconds.
  • Smooth predictions with EMA filter (to prevent jitter).

FER (Face Emotion Recognition) — live face tracking pipeline

Steps:

  1. Capture camera frames (every 300–500ms or real-time stream).
  2. Detect faces (MTCNN or RetinaFace).
  3. Crop & resize → 224x224 → ResNet-50.
  4. Predict emotion + confidence.
  5. Optional: track face IDs with correlation filter or face embeddings.

Implementation:

  • Use DeepFace for quick prototype.
  • Later: ResNet-50 fine-tuned with on-device quantization.
  • On Flutter: use TensorFlow Lite with GPU delegate.

Multimodal fusion in live systems

  • Late fusion: combine predictions using exponential smoothing:

    fused_emotion = α * FER + (1-α) * SER
    
  • Early fusion: merge embeddings and use attention block.

  • Temporal fusion: maintain history window to predict emotion trends.

Visualization: moving emotion bar / emoji that updates smoothly based on recent frames. owo~


Latency optimization, streaming protocols & infrastructure

Tech choices:

  • WebRTC for real-time low-latency streams.
  • WebSocket fallback for simpler setups.
  • On backend: async inference + thread pools.
  • Batch frames by small window (50–100ms) to reduce overhead.

Optimization checklist:

  • Quantize models (INT8 / FP16).
  • Use ONNXRuntime with CUDA EP or TensorRT.
  • Compress frames to 224x224 JPEGs.
  • Use circular audio buffer to avoid reallocation.

Datasets, fine-tuning, and live data considerations

Training:

  • Use static datasets first (FER2013, RAVDESS, IEMOCAP).
  • Collect opt-in live data via app (with user consent).
  • Fine-tune models incrementally as live data grows.

Real-world variability:

  • Different light levels, accents, mic quality.
  • Use online augmentation (noise injection, brightness jitter).

Privacy & ethical concerns for live emotion data

  • Live feeds are sensitive! Always use user consent dialogs.
  • Never store raw media without opt-in.
  • Encrypt all traffic (TLS 1.2+).
  • Allow user to pause live recognition anytime.
  • Add a small visual cue (recording indicator) for transparency.

Roadmap for live system evolution

Phase 0: Local prototype with periodic snapshots (React Native + FastAPI). Phase 1: Continuous live streaming with basic WebSocket (DeepFace + SER CNN). Phase 2: Low-latency WebRTC setup (Flutter + Go + ONNXRuntime). Phase 3: Transformer-based SER, ResNet-50 FER, attention fusion. Phase 4: Edge inference + on-device model distillation.


Appendix — cute tips & developer notes

✨ Keep FPS moderate (3–6 FPS for FER works fine). ✨ Use asynchronous loops with small buffers to avoid memory leaks. ✨ Add visual feedback (emotion emojis with soft animation). ✨ Test on multiple devices with varying light/noise. ✨ Always keep things user-friendly and privacy-safe. UwU 💞


Made with realtime love, happy threads, and big UwU energy.

About

Project Fair View

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors