ReadPaper Blog
Lip Forcing: Few-Step Autoregressive Diffusion for Real-Time Lip Synchronization
The paper introduces Lip Forcing, a few-step autoregressive diffusion framework for video-to-video lip synchronization that targets the gap between high-quality diffusion results and real-time deployment. It distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students that generate streaming lip-synced video in two denoising steps, making real-time performance possible while preserving reference fidelity and audio-visual alignment.
Source: Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Why lip-sync diffusion felt too slow
Lip Forcing addresses the central deployment problem in diffusion-based video-to-video lip synchronization: existing models can produce realistic talking-face edits with strong audio-visual alignment, but their inference cost is too high for live translation, virtual avatars, interactive agents, and other latency-sensitive uses. The paper identifies two main bottlenecks behind this cost: full-sequence bidirectional attention, which scales poorly with clip length, and the tens of denoising steps typically required for high-quality diffusion sampling. The authors argue that simply reducing steps is not enough, because lip synchronization entangles speaker identity, head pose, background preservation, and audio-conditioned mouth motion. Their proposed solution is an autoregressive diffusion student that generates chunks causally from past outputs rather than attending bidirectionally over the whole sequence. This reframes lip synchronization as a streaming generation problem while retaining the visual advantages of a large diffusion teacher.

The hidden tradeoff inside the teacher
The paper’s key diagnostic finding is a classifier-free guidance, or CFG, fidelity–sync tradeoff in the 14B bidirectional teacher’s denoising trajectory. In the teacher-trajectory analysis, no-CFG predictions better preserve reference fidelity, reflected by the paper’s discussion of LPIPS behavior, while CFG-guided predictions improve audio-visual synchronization, reflected by Sync-C, mainly within a mid-trajectory band. This shows that audio conditioning is not equally useful at every point in the rectified-flow denoising process, and that a single fixed guidance scale can force an unfavorable compromise between face preservation and lip motion accuracy. The authors further probe this behavior with a no-CFG to CFG-guided Euler-step analysis to locate a two-step operating point near a mid-trajectory landing. This analysis supplies the empirical basis for the paper’s later design choices rather than treating few-step distillation as a generic acceleration trick.

Lip Forcing’s three-part fix
Lip Forcing turns the trajectory analysis into a three-part distillation recipe: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. Sync-Window DMD modifies Distribution Matching Distillation by enabling CFG only on training timesteps inside the sync-favoring band identified in the teacher analysis, rather than applying fixed teacher guidance throughout training. The two-step inference schedule uses only two model calls and places the second step at the analysis-derived landing point, aligning the student’s short trajectory with the teacher region where synchronization benefits are strongest. The SyncNet-based reward adds explicit lip-alignment supervision to the distillation objective, addressing the fact that visual realism alone does not guarantee correct audio-mouth correspondence. Together, these components specialize Self Forcing and DMD for video-to-video lip synchronization instead of directly inheriting a general autoregressive video distillation recipe.

What the students achieve
The experiments validate Lip Forcing at two student scales, both distilled from the same 14B audio-conditioned bidirectional video diffusion teacher and evaluated on HDTF. The 1.3B student reaches 31 FPS, which the paper reports as crossing the 25 FPS real-time threshold and running 17.6× faster than its same-scale bidirectional model. The 14B student is reported as the largest diffusion model for video-to-video lip synchronization in the paper’s comparison, and it runs 39.8× faster than its teacher at comparable reference fidelity. The paper also reports sub-millisecond time-to-first-frame for both student scales, a critical property for streaming use where startup delay matters as much as average throughput. These results place the students on a stronger throughput–FVD tradeoff than prior diffusion lip-sync methods according to the paper’s Pareto-frontier claim.

Takeaway: faster without losing the face
The main implication of Lip Forcing is that real-time video-to-video lip synchronization can be achieved with causal diffusion when the distillation process is tailored to the synchronization task. The paper’s contribution is not merely compressing a 50-step teacher into a two-step student, but showing that the guidance window, trajectory landing point, and reward signal must reflect the specific conflict between reference preservation and audio-driven mouth motion. By combining autoregressive chunk-wise generation with KV-cache-friendly streaming and lip-sync-aware distillation, the method reduces the dependence on full-sequence bidirectional attention. The approach suggests a broader lesson for diffusion acceleration: task-specific trajectory analysis can reveal where conditioning helps and where it harms, enabling more precise student training. The paper positions Lip Forcing as the first autoregressive diffusion method for video-to-video lip synchronization and as evidence that diffusion-quality talking-face editing can move from offline processing toward real-time deployment.
