ReadPaper Blog
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
StreamChar addresses real-time text-to-character audio-video generation, where a system must speak a requested transcript, preserve visual identity across streamed chunks, and generate faster than playback. The paper proposes a decoupled framework in which an LLM-based orchestrator handles long-horizon transcript planning while a joint audio-video DiT performs short-window denoising, supported by a two-stage distillation strategy for low-latency streaming. Its experiments report a favorable trade-off among transcript fidelity, audio-visual synchronization, visual quality, and long-horizon stability on a single H100 GPU.
Source: StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

A long song, but the little character must keep up
The paper frames streaming character audio-video generation as a problem where transcript fidelity, identity preservation, synchronization, and latency must all be satisfied at once. Unlike short-clip generation, long-horizon streaming requires each chunk to remain consistent with the cumulative script while also preserving speaker appearance and motion continuity across segment boundaries. StreamChar targets the harder joint text-to-audio-video setting, where the system must generate speech content and character animation together rather than receiving a finished waveform as input. The authors argue that real-time use imposes a strict playback budget, making conventional diffusion sampling too slow unless the model is aggressively distilled. The central motivation is that low-latency generation and long-horizon coherence interact: faster few-step students can amplify drift, while autoregressive drift makes efficient streaming harder to train reliably.

Why old ways wobble over time
The paper identifies a specific weakness in monolithic or purely chunk-wise approaches: one backbone is often forced to perform semantic script tracking, cross-chunk memory, and local spatiotemporal denoising simultaneously. In autoregressive streaming, local mistakes in one chunk become conditioning context for the next, so transcript omissions, repetitions, audio misalignment, and visual shifts can accumulate over time. The introduction emphasizes that multimodal DiT systems that work well for short clips may degrade when extended to minutes-long interactive generation because global planning errors propagate directly into local synthesis. The authors also connect this problem to distillation-induced mode collapse, where reducing denoising steps can narrow spatial behavior and reduce temporal quality. This diagnosis motivates StreamChar’s separation of global orchestration from local generation rather than simply scaling a single unified diffusion transformer.

StreamChar splits the job into two gentle hands
StreamChar’s main architectural contribution is a decoupled LLM orchestrator plus short-window joint audio-video DiT. The orchestrator reads the prompt, transcript, and historical context, then produces a compact frame-aligned audio condition that specifies what should be acoustically expressed in the active chunk. The DiT receives noisy video and audio latents, timestep information, frozen T5 prompt embeddings, the orchestrator’s audio condition, and visual conditions such as reference and motion frames. By assigning long-range semantic planning to the LLM pathway, the DiT can concentrate on local bidirectional denoising of the current audio-video window. The paper’s latent-flow formulation maps video and audio into VAE latent spaces and trains the DiT to learn a time-dependent velocity field that reverses interpolation from clean latents toward Gaussian noise. Motion-frame conditioning supplies explicit temporal context from previously generated content, helping preserve continuity across chunks without making the denoiser responsible for the entire long transcript.

Fast enough for real-time, but trained in two steps
For real-time deployment, the paper proposes a two-stage decoupled distillation pipeline rather than optimizing speed and rollout consistency in one entangled step. Stage I applies distribution matching distillation to compress a teacher sampler into a few-step student while focusing on single-chunk generation quality. Stage II then fine-tunes the distilled student under online rollout simulation, where the model generates multiple consecutive chunks autoregressively and learns from the conditions created by its own outputs. This design directly addresses the mismatch between isolated short-window training and streaming inference, where accumulated errors alter future inputs. The authors argue that separating sampler compression from rollout training reduces gradient interference between step reduction and long-horizon stability. In the reported evaluation, this pipeline supports real-time streaming on a single H100 GPU while remaining competitive with recent joint and audio-driven baselines.

Two small tricks stop the drift
The paper adds two rollout-specific mechanisms to stabilize long-horizon streaming: a progress-aware pointer and a sink-frame memory mechanism. The progress-aware pointer predicts the transcript endpoint for each generated chunk, aligning partial transcripts with generated audio during rollout training and reducing semantic drift in chunk-wise generation. This mechanism directly targets the paper’s concern that autoregressive systems can lose their place in the global script through omissions, repetitions, or mismatched speech content. The sink-frame mechanism keeps the first chunk available as a persistent long-range visual anchor attended by later chunks. By preserving an early identity and appearance reference throughout the rollout, the sink-frame design is intended to suppress progressive video drifting. Together, these mechanisms support the paper’s broader claim that long-horizon coherence requires explicit transcript progress tracking and durable visual memory in addition to a fast local generator.
