Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

作者: Yoshiyuki Ootani

分类: cs.CV, cs.LG

发布日期: 2026-06-04

备注: 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at https://github.com/otanl/dreamlite-stream (also mirrored to Hugging Face and Zenodo)

💡 一句话要点

提出视频速率流式风格化方法以解决实时文本到图像生成瓶颈

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频流式处理 风格化生成 蒸馏训练 多模态融合 实时生成

📋 核心要点

现有的文本到图像生成方法在实时处理时面临逐帧瓶颈，限制了生成速度和效率。
论文提出了一种结合蒸馏U-Net和MLLM文本编码器的流式处理管道，通过不对称CUDA流水线等机制优化性能。
在实验中，使用RTX 3090 Ti时，管道在480帧的运行中可实现27.4 fps，显示出显著的性能提升。

📝 摘要（中文）

本论文研究了一种针对视频速率流式风格化的管道，利用经过蒸馏的U-Net和多模态大语言模型（MLLM）文本编码器，解决了实时文本到图像生成中的逐帧瓶颈问题。通过三种工程机制的结合，论文展示了在消费级GPU上实现高帧率的能力，具体包括不对称的CUDA流水线、编译友好的ControlNet-LLLite重构和周期性条件刷新调度。实验结果表明，该管道在RTX 3090 Ti上可达到27.4 fps的流式处理速度，且在更高性能的GPU上表现更佳。

🔬 方法详解

问题定义：本论文旨在解决实时文本到图像生成中的逐帧处理瓶颈，现有方法在生成速度上存在明显不足，影响了应用的实时性和流畅性。

核心思路：通过将蒸馏的U-Net与多模态大语言模型（MLLM）文本编码器相结合，论文提出了一种新的流式处理管道，重点在于优化编码器的处理效率，从而提升整体生成速度。

技术框架：该方法的整体架构包括三个主要模块：不对称的CUDA流水线处理、ControlNet-LLLite的编译友好重构，以及周期性条件刷新调度。这些模块协同工作，确保了高效的流式处理。

关键创新：论文的主要创新在于通过不对称的侧流/主流CUDA流水线和条件刷新机制，显著降低了每帧的条件成本，与现有方法相比，提供了更高的处理速度和效率。

关键设计：在参数设置上，论文使用了0.39B的蒸馏U-Net和2.13B的MLLM文本编码器，采用了批量大小B=8和B=16的实验设置，确保了在不同硬件上的性能优化。

🖼️ 关键图片

📊 实验亮点

实验结果显示，在RTX 3090 Ti上，该流式处理管道在480帧的运行中实现了27.4 fps的处理速度，而在RTX 4090和RTX 5090上分别达到了54.9 fps和74.1 fps，显示出显著的性能提升，尤其在高性能GPU上表现更为突出。

🎯 应用场景

该研究的潜在应用领域包括实时视频生成、在线内容创作和虚拟现实等场景。通过提升文本到图像生成的速度和效率，能够为创作者提供更流畅的创作体验，并推动相关技术在娱乐、教育等行业的应用与发展。

📄 摘要（原文）

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理