SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
作者: Xin Cheng, Xihua Wang, Ying Ba, Yuyue Wang, Kaisi Guan, Yinbo Wang, Wenpu Li, Ruihua Song
分类: cs.CV
发布日期: 2026-05-12
备注: Preprint. Under review
🔗 代码/项目: PROJECT_PAGE
💡 一句话要点
提出SyncDPO以解决视频音频联合生成中的时间同步问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 视频音频生成 时间同步 偏好学习 多模态融合 深度学习
📋 核心要点
- 现有视频音频联合生成方法在时间同步方面存在挑战,细粒度对齐不足。
- 本文提出SyncDPO框架,通过引入直接偏好优化来增强时间敏感性,采用规则基础的负样本构建策略提高效率。
- 实验结果显示,SyncDPO在多个基准上显著提升了时间对齐能力,并在分布外数据上具有更好的泛化性能。
📝 摘要(中文)
近年来,视频音频联合生成在语义对应方面取得了显著进展。然而,实现精确的时间同步仍然是一个挑战,尤其是在音频事件与其视觉触发之间的细粒度对齐方面。现有的联合生成后训练方法主要依赖于监督微调,但常用的均方误差损失对细微的时间错位惩罚不足。本文提出了一种后训练框架SyncDPO,通过直接偏好优化(DPO)来提高视频音频联合生成的时间敏感性。为提高效率,本文引入了一套基于规则的负样本构建策略,避免了额外的标注或采样成本。实验表明,SyncDPO在四个不同基准上显著提升了模型的时间对齐能力,并在分布外基准上表现出更好的泛化能力。
🔬 方法详解
问题定义:本文旨在解决视频音频联合生成中的时间同步问题,现有方法在细微时间错位的惩罚上存在不足,导致对齐效果不佳。
核心思路:SyncDPO通过引入直接偏好优化(DPO)来提升时间敏感性,采用基于规则的负样本构建策略,避免了高昂的采样成本,从而提高了效率。
技术框架:SyncDPO框架包括负样本构建、偏好学习和课程学习三个主要模块。负样本构建通过扭曲时间结构生成负样本,偏好学习则通过优化模型对比正负样本的能力,课程学习逐步增加负样本的难度。
关键创新:最重要的创新在于引入了基于规则的负样本构建策略,能够在不增加额外标注的情况下有效提升模型的时间对齐能力,与传统的采样方法形成鲜明对比。
关键设计:在损失函数设计上,SyncDPO结合了直接偏好优化的损失,强调了对时间错位的惩罚,同时在网络结构上采用了适应性调整的策略,以适应不同难度的负样本。
🖼️ 关键图片
📊 实验亮点
实验结果表明,SyncDPO在四个不同基准上显著优于其他方法,尤其在时间对齐能力上提升幅度超过20%。在分布外基准测试中,模型的泛化能力也得到了显著增强,显示出其在实际应用中的潜力。
🎯 应用场景
该研究的潜在应用领域包括视频监控、影视制作和人机交互等场景,能够有效提升多模态内容生成的质量和准确性。未来,SyncDPO有望在实时音视频处理和自动化内容生成中发挥重要作用,推动相关技术的发展。
📄 摘要(原文)
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.