workflow automation, AI translation, localization • 2026

Localization Video Player

A workflow automation built for a client who needed localized videos on the fly. They input a single edited video and an AI pipeline handles translation, voice cloning and lip-sync across languages. This player connects directly to that localization flow.

This player works best on desktop.

Pick a language below to see the AI-translated output. You can switch tracks while watching.

Language Track

How it works

The client needed a way to take a single edited video and publish it in multiple languages without re-editing. We built an automated pipeline that handles the full localization process — from transcription to translated audio to lip-synced output — so they can go from one master cut to seven languages with zero manual rework.

Ingest

Master video uploaded to cloud storage, webhook triggers the pipeline.

Transcribe

Whisper large-v3-turbo with WhisperX extracts speech with word-level timestamps and forced alignment on a HuggingFace worker.

Translate

Transcript sent to Claude 4.5 Sonnet for context-aware translation per target language, preserving tone and timing cues.

Voice & Sync

Fish Speech 1.5 clones the original speaker's voice per language. LatentSync handles lip-sync at 512px on a GPU worker.

Deliver

FFmpeg muxes final videos, uploads to CDN and registers tracks in the player API.

Stack

PythonFastAPIHuggingFace InferenceWhisper large-v3-turboWhisperXClaude 4.5 SonnetFish Speech 1.5LatentSyncFFmpegDockerGPU WorkersS3 / CDN

Architecture

Cloud workers

Each pipeline step runs as an isolated Python worker on auto-scaling GPU instances. Transcription and lip-sync jobs are queued via Redis and picked up by the next available node. LatentSync's diffusion passes are the heaviest step — average turnaround for a 5-minute video across 7 languages is under 12 minutes.

Player integration

Once all language tracks are rendered, the delivery worker registers each variant in the player's track API. The frontend player fetches available tracks on load and lets users switch languages mid-playback without losing their position — the same player you see on this page.