Apps Page Background Image
Workflows/Video Voice Cloning with Subtitles!

Video Voice Cloning with Subtitles!

Save it for me
Operate
@
Manu
02/02/2026
ComfyUI
Video Generation
New & Trending
1 / 0
Detailed Introduction

🎬 Video Voice Cloning— Make anyone say anything Alpha Version

Mimic PC Exclusive Workflow

🎧 1. Start with a video with a voice

Everything begins with real media:

  • an existing video
  • an audio track (or spoken narration)

No manual timing.

No subtitle editing.

No guesswork.

This workflow is built to understand speech, not just display text.

🧠 2. AI speech understanding (word-level precision)

The audio is first processed by QWEN ASR, producing:

  • a clean transcription
  • precise word-level timestamps
  • exact alignment between speech and time

Every word knows when it is spoken.

This is the foundation of everything that follows.

🎭 3. Voice cloning & re-narration (optional but powerful)

Instead of reusing the original audio, the workflow can:

  • clone a voice using QWEN TTS Voice Cloning
  • generate a brand-new narration from your text
  • keep tone, rhythm, and personality consistent

Then — and this is critical —

the generated voice is re-transcribed to recover perfect timestamps that match the new speech exactly.

No drift.

No approximation.

What you hear is what gets subtitled.

✍️ 4. Intelligent subtitle generation (not just burn-in)

Raw word timestamps are not readable subtitles.

So the workflow intelligently rebuilds them into real captions by:

  • grouping words using natural pauses
  • limiting line length for on-screen readability
  • preventing subtitles from overstaying

You don’t format subtitles.

The AI does.

⚡ 5. GPU-accelerated subtitle rendering

Here’s where performance matters.

Subtitles are:

  • rendered only when the text changes
  • cached once
  • blended onto frames on the GPU

No redundant rendering.

No wasted frames.

Built to scale — even on long videos.

✂️ 6. Smart video trimming (no wasted processing)

Why process 20 seconds of video if subtitles end at 10?

The workflow automatically:

  • detects the last subtitle end time
  • outputs only the necessary frames
  • adds a configurable tail (e.g. +2s) to avoid abrupt endings

Fast.

Clean.

Intentional.

🎥 7. Final output: a fully synchronized video

At the end of the pipeline, everything is recombined:

  • processed video frames
  • cloned (or original) narration audio
  • correct frame rate metadata

What you get is not a technical artifact —

it’s a finished, publish-ready video.

🚀 Why this workflow is different

  • Not just subtitles — speech intelligence
  • Word-accurate timing from real audio
  • Voice cloning fully synchronized with captions
  • GPU-optimized for speed and scale
  • Automatic trimming for efficiency
  • One workflow, zero manual syncing

This is not editing.

This is AI-driven voice-to-video intelligence.

Developed and optimized exclusively for Mimic PC.
Details
APPComfyUI(v0.11.0)
Update Time02/02/2026
File Space10.3 GB
Models0
Extensions7