MUTON is a real-time multimodal dialogue assistance system for hearing-impaired users, especially users who rely on oral communication rather than sign language. The project extends ordinary speech-to-text by combining speech, facial expression, and dialogue context to provide subtitles and short context-aware summaries.
- Overview
- Motivation
- System Pipeline
- Installation
- Prepare Runtime Assets
- Prepare Datasets
- Training / Adaptation
- Running The Server
- API Reference
- Evaluation
- Android Integration
- Repository Structure
- Wiki & Documentation
- License
The current recommended runtime uses OpenAI whisper-1 for Korean speech-to-text and Qwen2.5-Omni + ko_stage LoRA for multimodal summary generation. This split is intentional: STT is handled by a stable transcription backend, while multimodal reasoning is handled by a pretrained omni model that can use visual, audio, and text context together.
The repository also keeps the earlier P-project pipeline, including face/audio/text encoders and custom fusion models. Those files are maintained as experiment history and comparison baselines, while the current mobile demo path is based on the Qwen server.
Most captioning services answer only one question: what was said. Real conversation also depends on how it was said, including facial expression, hesitation, tone, emphasis, and the surrounding dialogue flow. MUTON was built to reduce this gap by turning multimodal signals into a more useful communication aid for real-time mobile situations.
In P-project, the main goal was to design a multimodal fusion model directly. In Graduation Project 2, the focus shifted toward a more practical service architecture: using stronger pretrained multimodal models, improving STT reliability, synchronizing utterance-level inputs, and connecting the backend to an Android client that can be used in a live demo.
The current pipeline separates low-latency transcription from multimodal summary generation. Audio chunks are buffered and segmented into utterances, video frames are processed for face/emotion context, and committed utterance snapshots are passed to the Qwen-based summary path.
pip install -r requirements.txt
pip install -r requirements-qwen-omni.txtRecommended runtime requirements:
- Python environment with CUDA-capable PyTorch
- FastAPI and Uvicorn for the backend server
- Hugging Face Transformers, Accelerate, and PEFT for Qwen2.5-Omni
- MediaPipe for face landmark processing
- OpenAI API access for
whisper-1STT and server-side conversation record summaries - Cloudflare Tunnel for exposing the local backend to the Android app
The Qwen runtime expects the base model dependencies and the trained LoRA adapter path to be available on the server.
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY
export MUTON_QWEN_ADAPTER=/path/to/out/qwen_omni_lora/ko_stage
export MUTON_QWEN_STT_BACKEND=openaiOptional runtime switches:
MUTON_QWEN_STT_BACKEND=openaiuses OpenAIwhisper-1.MUTON_QWEN_STT_BACKEND=localuses the local Korean Whisper fallback.MUTON_RECORD_SUMMARY_MODEL=gpt-4o-minicontrols the server-side record summary model.
MUTON uses two dataset directions:
- Korean multimodal samples built from conversation videos, aligned face crops, audio utterances, transcripts, and summary targets.
- MELD-based auxiliary samples reconstructed through utterance matching, Korean translation, representative frame/audio extraction, and pseudo-summary generation.
Dataset-related scripts:
scripts/build_rich_ko_dataset.py
scripts/build_rich_meld_dataset.py
scripts/export_qwen_omni_ko_dataset.py
scripts/export_qwen_omni_meld_dataset.py
src/qwen_omni_dataset.py
The P-project dataset format was feature-oriented for a custom fusion Transformer. Graduation Project 2 added JSONL-style multimodal message samples so that image, audio, and text inputs could be adapted to the Qwen2.5-Omni workflow.
The current branch keeps both legacy fusion experiments and Qwen adaptation scripts.
scripts/train_fusion_seq2seq.py
scripts/train_fusion_seq2seq_two_stage.py
scripts/train_rich_fusion_seq2seq.py
scripts/train_qwen_omni_lora.py
scripts/train_qwen_omni_lora_two_stage.py
The recommended Graduation Project 2 path is Qwen2.5-Omni LoRA adaptation. The legacy fusion models remain useful for explaining the project transition from direct encoder fusion to pretrained multimodal generation.
Start the Qwen backend:
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
export PYTHONIOENCODING=utf-8
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY
export MUTON_QWEN_ADAPTER=/path/to/out/qwen_omni_lora/ko_stage
export MUTON_QWEN_STT_BACKEND=openai
CUDA_VISIBLE_DEVICES=1 python scripts/run_qwen_server.pyExpose the local server through Cloudflare Tunnel:
cloudflared tunnel --url http://127.0.0.1:5000Publish the active tunnel URL for the Android app:
python scripts/update_backend_url.py https://xxxxx.trycloudflare.com
git add backend_url.json
git commit -m "Update backend URL"
git push origin server_mainThe Android app reads:
https://raw.githubusercontent.com/Ai-pre/MUTON/server_main/backend_url.json
Main runtime endpoints:
GET /healthchecks whether the Qwen backend is running.POST /process_audio_chunkreceives PCM audio chunks, performs utterance buffering, and returns subtitle text when an utterance is finalized.POST /process_video_chunkreceives camera frames and returns visual emotion context.POST /get_fusion_analysisgenerates the current multimodal summary from committed audio, video, and transcript context.POST /summarize_conversation_recordsummarizes saved conversation records on the backend so the Android app does not need to contain an OpenAI API key.
Detailed request and response examples are maintained in the wiki.
The project is evaluated from both model and service perspectives:
- STT quality: Korean recognition accuracy, repeated-token suppression, hallucination filtering, and utterance segmentation timing.
- Multimodal summary quality: consistency between transcript, facial expression, audio context, and generated Korean summary.
- Real-time usability: end-to-end latency from Android streaming to subtitle/summary display.
- Robustness: behavior under noisy environments, weak network conditions, and changing Cloudflare tunnel URLs.
- Comparison baseline: P-project fusion Transformer outputs versus the Qwen2.5-Omni based Graduation Project 2 pipeline.
The Android client lives in a separate repository:
The client streams camera frames and audio chunks to this backend, receives subtitles and multimodal summaries, and uses backend_url.json to discover the active Cloudflare endpoint.
MUTON/
scripts/
run_qwen_server.py current FastAPI entrypoint
run_server.py legacy fusion server entrypoint
update_backend_url.py updates backend_url.json
build_rich_ko_dataset.py Korean dataset builder
build_rich_meld_dataset.py MELD-based dataset builder
export_qwen_omni_ko_dataset.py Qwen JSONL exporter
export_qwen_omni_meld_dataset.py MELD-to-Qwen exporter
train_qwen_omni_lora.py Qwen LoRA training script
src/
server_qwen.py current Qwen summary + STT server
encoders.py face/audio encoders and STT backends
qwen_omni_dataset.py Qwen dataset utilities
server.py legacy fusion runtime
fusion_seq2seq.py legacy seq2seq experiments
wiki/ GitHub wiki-ready documentation
backend_url.json Android backend discovery file
- wiki home:
wiki/Home.md - installation:
wiki/Installation.md - API reference:
wiki/API.md - datasets:
wiki/Datasets.md - training:
wiki/Training.md - evaluation:
wiki/Evaluation.md - architecture and model evolution:
wiki/Architecture.md - request examples:
wiki/Examples.md - Android client repository: MUTON-Android