Introduction
Integrating LatentSync into ComfyUI's workflow can build an efficient end-to-end lip sync generation process, giving full play to its advantages based on the audio conditional latent diffusion model. The workflow starts with the "Media Input" node. After importing the original video and audio, connect the "Data Preprocessing" node - this node replicates the data processing logic of LatentSync, automatically completing video frame rate unification (25FPS), audio sampling rate adjustment (16000Hz), face detection and affine transformation (based on facial feature point alignment to 256Ă256 size), and filtering low-quality materials (such as faces that are too small and synchronization confidence is too low).
The pre-processed media stream is divided into two paths: audio data is connected to the "Whisper Embedding Extraction" node to generate audio feature embedding; video frames enter the "Latent Diffusion Generation" core node, which encapsulates the U-Net architecture of LatentSync, fuses audio embeddings through cross-attention layers, and enables the TREPA module (temporal representation alignment based on large-scale self-supervised video models) to solve the inter-frame inconsistency problem of traditional diffusion models.
During the process, you can add a "Parameter Adjustment" node to flexibly configure guidance_scale (it is recommended to set it to 1.5 to improve lip synchronization accuracy), diffusion steps and other parameters, and finally output the result of accurate synchronization of lip movement and audio through the "Video Synthesis" node. The entire workflow relies on the node visualization feature of ComfyUI, which simplifies the technical complexity of LatentSync, while retaining its ability to use Stable Diffusion to model complex audio-visual associations, which is suitable for scenes such as virtual human driving and film and television post-production.
https://github.com/bytedance/LatentSync
Recommended machineďźUltra-PRO
Workflow Overview
How to use this workflow
Step 1: Load Video
Step 2: Load Audio
Step 3: Adjusting parameters
1. For speeches or presentations that require clear lip movements, try increasing the 'lips_expression' value to 1.6 - 2.5. For everyday conversations, the default value of 1.5 usually works well.
2. If lip movements appear unnatural or exaggerated, try lowering the 'lips_expression' value.