Localization Video Player
A workflow automation built for a client who needed localized videos on the fly. They input a single edited video and an AI pipeline handles translation, voice cloning and lip-sync across languages. This player connects directly to that localization flow.
Pick a language below to see the AI-translated output. You can switch tracks while watching.
How it works
The client needed a way to take a single edited video and publish it in multiple languages without re-editing. We built an automated pipeline that handles the full localization process — from transcription to translated audio to lip-synced output — so they can go from one master cut to seven languages with zero manual rework.
Ingest
Master video uploaded to cloud storage, webhook triggers the pipeline.
Transcribe
Whisper large-v3-turbo with WhisperX extracts speech with word-level timestamps and forced alignment on a HuggingFace worker.
Translate
Transcript sent to Claude 4.5 Sonnet for context-aware translation per target language, preserving tone and timing cues.
Voice & Sync
Fish Speech 1.5 clones the original speaker's voice per language. LatentSync handles lip-sync at 512px on a GPU worker.
Deliver
FFmpeg muxes final videos, uploads to CDN and registers tracks in the player API.
Stack
Architecture
Cloud workers
Each pipeline step runs as an isolated Python worker on auto-scaling GPU instances. Transcription and lip-sync jobs are queued via Redis and picked up by the next available node. LatentSync's diffusion passes are the heaviest step — average turnaround for a 5-minute video across 7 languages is under 12 minutes.
Player integration
Once all language tracks are rendered, the delivery worker registers each variant in the player's track API. The frontend player fetches available tracks on load and lets users switch languages mid-playback without losing their position — the same player you see on this page.