An automated video localization and translation pipeline for Indic languages, built to extract, translate, and synthesize video files.
1. Product Overview
2. Technical Architecture
3. Cost Structure
4. Installation and Deployment
5. Access & Credentials
6. Data Flow Info
7. Roadmap & Future Work
- This tool helps to translate educational videos into multiple native Indian languages. Primarily, educational videos are made in English, this makes sure it localizes every part of video to make it accessible for students of diverse language groups.
- The primary usecase is localization of video in a sequence of steps. The target audience are Educators who can localize their educational content into Indic languages, which then could be served to students.
- Localizes video content into multiple Indic languages asynchronously.
- Supports options for both local file & google drive link uploads.
- Automated extraction and translation of on-screen text.
- Generates localized subtitles and translates audio tracks.
- Human-In-The-Philosophy (HITL) to monitor progress steps throughout the entire localization pipeline.
- Versioned REST APIs for pipeline extension and integration.
- Supports video files with
.mp4format. - Currently suppports only 3 languages - Hindi, Marathi, Punjabi.
- Need for manual review within pipeline if translation not appropriate.
This diagram illustrates the external API routing and media transformation pipeline. The system branches into specialized workflows based on the selected target language.
-
Phase 1: Input & Routing
The pipeline evaluates the original video and routes it based on the selected target language i.e., splitting into a direct path for Hindi or a path for other Indic languages (e.g., Marathi, Punjabi). -
Phase 2a: Non-Hindi Translation
A multi-step pipeline for languages which need more control over text translation & audio.-
Audio Extraction: FFmpeg extracts the audio track from the uploaded video or link.
-
Transcription: ElevenLabs STT (Speech-to-Text) generates the original transcript.
-
Translation: Bhashini API translates the text segments.
-
HITL 1 (Human-In-The-Loop): The pipeline pauses, allowing the user to review and edit the translated segments via the UI.
-
Speech Generation: Upon approval, ElevenLabs TTS (Text-to-Speech) synthesizes the new localized audio track.
-
-
Phase 2b: Hindi Direct Dubbing
A straightforward workflow that uses the ElevenLabs Dubbing API to handle end-to-end dubbing directly from the source video. -
Phase 3: On-Screen Text Localization
Once the translated audio track or muxed video is created, the pipeline triggers the onscreen text localization process:-
Text Recognition (OCR): Google Video Intelligence API extracts text directly from the video frames with verbose response metadata.
-
Translation: Bhashini translates the extracted OCR text.
-
Subtitle Generation: The system automatically formats the translations into a VTT subtitle file.
-
HITL 2 (Human-In-The-Loop): The pipeline pauses for a final user review of the translated on-screen text.
-
-
Phase 4: Final Media Synthesis
Triggered by the final HITL review, FFmpeg then collects info from childtable & permanently overlays the translated text (using filter script file) into the video frames and muxes the final video to produce the Final Localized Video.
This diagram illustrates the queueing & processing flow of a record created in this video localization pipeline. The following highlights the lifecycle of this flow:
- Upload Phase
The process begins when a user uploads a video or link. This creates aVideo Inforecord in database. The uploaded video file is stored underpublic/files/original/directory. - Background Job Queues
Once the localization process is initiated, a chain of background jobs are triggered sequentially. Managed by Frappe's queues, this allows multiple video records to be processed concurrently as each record waits in a queue with designated functionality. The pipeline is segmented as specialized task queues like audio extraction, speech translation, ocr, etc. Each subsequent function is enqueued only after the successful completion of the preceding step. - Completion/Processing Phase
As background workers execute these tasks, they continuously update a singleProcessed Video Inforecord with status & progress. During this automated flow, it includes HITL phases for review which then trigger to start subsequent functions within queues. As stages conclude, resulting files are stored topublic/files/processeddirectory.
- Frappe is a low-code web framework which handles server, client-side, database and other configurations altogether.
- Frontend: Javascript (custom client scripts)
- Backend: Python
- Database: MariaDB (MySQL)
- AI Model Providers: Bhashini & ElevenLabs
- Infrastructure: Google Cloud Platform (GCP)
- Cloud Environment:
- Machine Type: e2-highcpu-8 (8 vCPUs, 8 GB memory)
- Gunicorn runs the Frappe application, nginx receives web traffic. Supervisor is for starting gunicorn processes like workers, scheduler.
- This pipeline uses frappe background job queues, thus the workers config is defined in
common_site_config.json. Background workers can be increased to pick up more jobs lined up in the queues. During setup, values are set automatically, but can be changes can be made as necessary.{ "background_workers": "2", "default_site": "[your_site_name]", "developer_mode": true, "file_watcher_port": 6787, "frappe_user": "[your_user_name]", "gunicorn_workers": "[2 x CPU_Cores + 1]", "live_reload": true, "rebase_on_pull": false, "redis_cache": "redis://127.0.0.1:13000", "redis_queue": "redis://127.0.0.1:11000", "redis_socketio": "redis://127.0.0.1:13000", "restart_supervisor_on_update": false, "restart_systemd_on_update": false, "serve_default_site": true, "shallow_clone": true, "socketio_port": 9000, "use_redis_auth": false, "webserver_port": 8000 }
- External Dependencies:
- FFMPEG
- Bhashini
- ElevenLabs
- Gdown
- Google Cloud Video Intelligence
- Groq (Optional)
- More info regarding ElevenLabs requests & usage analytics, etc can be found at API Activity.
- Current ElevenLabs utilisation is Pro plan, and a reference cost analaysis can be found here Comparative Cost Analysis amongst others.
Note: Do refer the elevenlabs api pricing as it could differ based on subscription model or pay-as-you-go model. - Google Video Intelligence's usage can be monitored under Google Cloud Console -> APIs & Services -> Dashboard. The billing would be visible under Google Cloud Console -> Billing and then can be filtered as per project and SKU.
- Ensure you have a standard Bench Frappe Installation Guide environment installed as this project uses Frappe 15.
- Note on python dependency managament: If you're on an externally managed environment, follow the instructions in the above Frappe docs but substitute uv over pip/venv.
- Once bench is ready:
bench init [directory-name]
- Create a new site for this project locally:
cd $PATH_TO_YOUR_BENCH bench new-site [your-site-name] bench use [your-site-name]
Now go to the bench directory ( if not already ), and install/clone the frappe video translation app:
cd $PATH_TO_YOUR_BENCH
bench get-app [github-repo-url-or-ssh] --branch main
bench --site [your-site-name] install-app my_appNote: Run the folowing command from Bench directory to manually create /original and /processed folders under site's public directory:
cd $PATH_TO_YOUR_BENCH
mkdir sites/[your_site_name]/files/public/{original,processed} (replace your_site_name with your appropriate site name created)
This frappe app provides a better preview for video uploads upon saving a record for a doctype, improving user experience.
bench get-app git@github.com:Z4nzu/frappe-preview-attachment.git
bench --site your-site-name install-app preview_attachmentReference: Frappe Video Preview github
Install uv if it's the choice. (Official site: uv Installation Reference)
curl -LsSf https://astral.sh/uv/install.sh | shThis is valid for both local development & deployement server.
- Activate virtual environment if not already:
source env/bin/activate - Go back to the Bench directory & run the following so that dependencies within
pyproject.tomlare installed (editable mode) in central bench frappe environment:cd $PATH_TO_YOUR_BENCH uv pip install -e apps/my_app
- While developing if there is a need to add a new dependency to the project, follow:
- Navigate to
apps/my_app& run:uv add <dependency-name> --no-sync
- Return to the bench directory & run the following to install updated dependencies into bench environment:
(Alternatively, you can run
cd $PATH_TO_YOUR_BENCH uv pip install -e apps/my_app
bench pip install -e apps/my_app)
- Navigate to
-
FFMPEG:
- Used for audio/video processing ( e.g., extracting audio from uploaded video, onscreen text overlay, audio-video muxing ).
- Installation Reference: FFMPEG Official Docs
-
Bhashini:
- GOI's language translation platform for Indian languages ( STS, ASR, Language Detection, etc. ).
- Required credentials are userId, ulcaApiKey, email id.
- Setup Guide: Bhashini Postman Docs
-
Elevenlabs:
- Provides high-quality voice dubbing, multilingual TTS & other services.
- Docs: ElevenLabs API
-
Google Video Intelligence API:
- An API service from google cloud, used for automatically recognizing metadata within video allowing for extraction. ( e.g., OCR, label detection, etc)
- Docs: Video Intellligence API
-
Gdown:
- Used for downloading public files links from Google drive.
- Docs: Gdown
Once the app and dependencies are installed, you need to configure your environment and start the Frappe processes.
- Configure API Keys: Ensure your
site_config.jsonis updated with the necessary API keys as defined in the Application Credentials (5.2) section. - Start the Web Server: Open a terminal in your bench directory and start the Frappe development server:
bench start
- Start Background Workers: Open a separate terminal in the bench directory and start the worker processes listening to required queues:
bench worker --queue short,default,long
- Access the App by visiting
http://localhost:8000/app(orhttp://[your-site-name]:8000/appif mapped locally).
When pulling updates or managing the production server, use the following commands from your bench directory if & when necessary:
- When database is updated such as changes or new additions in doctype, run the following command:
bench --site [your_site_name] migrate
- To verify that gunicorn and Frappe workers are running correctly:
sudo supervisorctl status
- To apply code changes or configuration updates to the live environment, restart all supervisor processes:
sudo supervisorctl restart all
For quick sanity checks during development, a simple UI is provided to test APIs directly in the browser via www/test-video.html.
- Ensure your local server is running (
bench start). - Open your browser and navigate to
http://localhost:8000/test-video. - Select the desired endpoint from the dropdown (e.g.,
ping). - Update the Authorization header with your API token if testing a whitelisted/protected function (see Section 5.2).
- Modify the request body as needed and click Run Test.
- The JSON response will be displayed directly on the page.

Note: You can add more endpoints to the testing dropdown by modifying the
www/test-video.htmlfile.
The system uses two different authentication mechanisms, depending on what you are trying to do:
When visiting http://localhost:8000/app, you login using your standard Frappe user credentials.
-
This relies on session-based authentication (handled entirely by Frappe).
-
It is required to access the Desk UI (creating & interacting with doctypes like Video Info, and accessing non-whitelisted endpoints/functions).
When externally calling API endpoints directly (if the function is whitelisted), auth headers are required to ensure only authorized clients can access the pipeline entry points. These tokens can be stored securely in your site_config.json.
- Token Format (Recommended)
Authorization: token <api_key>:<api_secret> - Basic Format
Authorization: Basic <base64(api_key:api_secret)>
Reference: Frappe Token Auth
- Google Cloud Authentication: For the Google Cloud Video Intelligence API, you can place the generated service account JSON credential file directly under the site's directory (
sites/[your_site_name]/). Authentication can be configured using other preferred GCP methods as well which is detailed in the Video Intelligence API Docs and other info in google auth options. - We utilise
site_config.jsonfor settings & environment variables. Values are accessed byfrappe.conf.[variable_key_name]{ "db_name": "[Your_DB_Name]", "db_password": "[Your_DB_Password]", "db_type": "mariadb", "db_user": "[Your_DB_User]", "max_file_size": "[filesize_in_bytes]", "encryption_key": "[Encrption_Key]", "api_auth_value": "[Bhashini API Authentication value]", "groq_api_key":"[GROQ_API_KEY]", "elevenlabs_api_key":"[ELEVENLABS_API_KEY]", }
The database schema contains the application's doctypes: Video Info, Processed Video Info and other child tables. The following diagram highlights the doctype design, definitions and the relationships between them.

This sequence diagram illustrates the end-to-end flow of a non-hindi localization pipeline. It maps interactions between Frappe backend, worker queues, and external API services such as ElevenLabs (TTS & STT), Bhashini (Text Translation), and Google Video Intelligence (Text Recognition). It also highlights where the automated pipeline involves human-in-the-loop (HITL) interventions, allowing users review, edit as the pipeline progress before finally synthesizing localized video files.
sequenceDiagram
autonumber
actor User
participant UI as Frappe UI
participant Backend as Frappe DB / Backend
participant Worker as Redis Worker Queues
participant FFmpeg as FFmpeg (Local)
participant STT as STT API
participant Bhashini as Bhashini API
participant TTS as ElevenLabs TTS
participant GVI as Google Video Intel
User->>UI: Upload Video & Select Target Lang (e.g., Marathi)
UI->>Backend: Create "Video Info" & "Processed Video Info"
Backend->>Worker: Enqueue labs_sts_translation task
rect rgb(240, 248, 255)
Note right of Worker: Phase 1: Audio to Translated Text
Worker->>FFmpeg: Extract original audio
Worker->>STT: Request audio transcription
STT-->>Worker: Original text segments
Worker->>Bhashini: Translate text segments
Bhashini-->>Worker: Translated text segments
Worker->>Backend: Save segments to Child Table
Backend-->>UI: Display Translation Grid
end
rect rgb(255, 235, 238)
Note over User, Backend: HITL 1: Review Segments
opt Optional Retry
User->>UI: Clicks "Retry" (adds Key Terms / Dict)
UI->>Backend: Trigger retry_trigger API
Backend->>Worker: Re-run translation task
end
User->>UI: Edits segments & Clicks "Generate Speech"
UI->>Backend: Trigger speech_generate API
Backend->>Worker: Enqueue TTS task
end
rect rgb(240, 248, 255)
Note right of Worker: Phase 2: Speech, OCR & Subtitles
Worker->>TTS: Send text for speech generation
TTS-->>Worker: Translated Audio Track
Worker->>Backend: Save Audio URL
Worker->>GVI: Detect on-screen text (OCR)
GVI-->>Worker: Video text timestamps
Worker->>Bhashini: Translate OCR text
Bhashini-->>Worker: Translated on-screen text
Worker->>Backend: Save to Onscreen Text Child Table
%% Subtitles generated here based on code
Worker->>Worker: Generate Subtitles (VTT file)
Backend-->>UI: Display Onscreen Text Grid
end
rect rgb(255, 235, 238)
Note over User, Backend: HITL 2: Review Onscreen Text
User->>UI: Edits onscreen text translations
User->>UI: Clicks "Generate Onscreen Translation"
UI->>Backend: Trigger onscreentxt_trans API
Backend->>Worker: Enqueue Final Synthesis task
end
rect rgb(240, 248, 255)
Note right of Worker: Phase 3: Final Video Generation
Worker->>FFmpeg: Apply text overlay to video
Worker->>FFmpeg: Mux new video with translated audio & overlay
FFmpeg-->>Worker: localized_video.mp4
Worker->>Backend: Save final URLs & Update Status: Success
Backend-->>UI: Render HTML Video Preview with Subtitles
end
For all future contributors:
- This project follows Conventional Commits ↗ (adopted from mid-development onwards).
- Check Issues for past progress and trackng for future issues to work on.
- Code Style
- We use ruff (python) and prettier (JS/JSON) for consistent formatting.(Recommended)
pre-commitis for code formatting and linting. An optional install pre-commit config is included in repo if required automatic checks, enable it:cd apps/my_app pre-commit install- Pre-commit is configured to use ruff, eslint, prettier, pyupgrade for checking and formatting your code.
- A demo reference video of the application in works can be found here: Localization Demo ↗