The LTX 2.3 Image-to-Video with Custom Audio Workflow is a powerful and specialized ComfyUI pipeline designed to create high-quality lip-synced videos using your own custom audio or music. Built on the advanced LTX 2.3 model, this workflow delivers precise lip synchronization, smooth motion, and cinematic HD output, making it ideal for talking avatars, singing videos, and creative storytelling.With just an image, audio input, and a descriptive prompt, you can generate videos where characters perfectly match speech or lyrics, bringing static visuals to life with natural expressions and motion.
How It Works:
1ď¸âŁ Upload the image you want to animate
2ď¸âŁ Upload your custom audio or music file
3ď¸âŁ Set the output resolution (width & height)
4ď¸âŁ Define the audio duration
5ď¸âŁ Write a detailed prompt describing actions (talking, dancing, walking) and camera motion
6ď¸âŁ Click Queue to generate your video
This workflow integrates a free LTX text encoder API, which removes the heavy computational load normally associated with local text encoders. By offloading this step:
- Generation runs smoother and faster
- GPU memory usage is significantly reduced
- Overall workflow efficiency is improved
Instructions for obtaining the free API key are as below
Get FREE LTX text encoder API ky from here : https://console.ltx.video
When you login with the above link you can create one like this

After copy that API in workflow field
The workflow produces a fully lip-synced video, where the characterâs mouth movements align accurately with the provided audio.
Key Features:
đ Accurate Lip Sync â Matches mouth movements precisely to speech or song lyrics
đ§ Custom Audio Support â Use any voice, dialogue, or music track
đ¤ Talking & Singing Capabilities â Perfect for AI presenters, performers, and virtual influencers
đŹ Prompt-Based Motion Control â Define actions, gestures, and camera movement through text
đźď¸ HD Video Output â Clean, stable visuals with natural motion
âžď¸ Flexible Duration â Generate videos based on your audio length
Performance Tips:
- ULTRA PRO GPU is recommended for best speed and quality
- Higher resolutions increase generation timeâadjust based on your needs
- Detailed prompts improve realism and motion accuracy
Ideal Use Cases:
- AI talking avatars and presenters
- Music videos and lyric-based animations
- Social media content creation
- Storytelling with character dialogue
- Virtual influencer videos
The LTX 2.3 Image-to-Video with Custom Audio Workflow delivers a seamless way to create realistic, expressive, and perfectly lip-synced videosâcombining advanced AI motion, audio alignment, and creative control inside ComfyUI.
