Workflows/ComfyUI-LatentSync 1.6: Video Lip Sync

ComfyUI-LatentSync 1.6: Video Lip Sync

Save it for me

Operate

MimicPC

07/08/2025

ComfyUI

Video Generation

1 / 0

Detailed Introduction

Introduction

Integrating LatentSync into ComfyUI's workflow can build an efficient end-to-end lip sync generation process, giving full play to its advantages based on the audio conditional latent diffusion model. The workflow starts with the "Media Input" node. After importing the original video and audio, connect the "Data Preprocessing" node - this node replicates the data processing logic of LatentSync, automatically completing video frame rate unification (25FPS), audio sampling rate adjustment (16000Hz), face detection and affine transformation (based on facial feature point alignment to 256×256 size), and filtering low-quality materials (such as faces that are too small and synchronization confidence is too low).

The pre-processed media stream is divided into two paths: audio data is connected to the "Whisper Embedding Extraction" node to generate audio feature embedding; video frames enter the "Latent Diffusion Generation" core node, which encapsulates the U-Net architecture of LatentSync, fuses audio embeddings through cross-attention layers, and enables the TREPA module (temporal representation alignment based on large-scale self-supervised video models) to solve the inter-frame inconsistency problem of traditional diffusion models.

During the process, you can add a "Parameter Adjustment" node to flexibly configure guidance_scale (it is recommended to set it to 1.5 to improve lip synchronization accuracy), diffusion steps and other parameters, and finally output the result of accurate synchronization of lip movement and audio through the "Video Synthesis" node. The entire workflow relies on the node visualization feature of ComfyUI, which simplifies the technical complexity of LatentSync, while retaining its ability to use Stable Diffusion to model complex audio-visual associations, which is suitable for scenes such as virtual human driving and film and television post-production.

https://github.com/bytedance/LatentSync

Recommended machine：Ultra-PRO

Workflow Overview

How to use this workflow

Step 1: Load Video

Step 2: Load Audio

Step 3: Adjusting parameters

1. For speeches or presentations that require clear lip movements, try increasing the 'lips_expression' value to 1.6 - 2.5. For everyday conversations, the default value of 1.5 usually works well.

2. If lip movements appear unnatural or exaggerated, try lowering the 'lips_expression' value.

Step 4: Get Video

Details

APP	ComfyUI(v0.3.43)
Update Time	07/08/2025
File Space	9.3 GB
Models	0
Extensions	5