ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions
作者: Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li
分类: cs.CV, cs.MM
发布日期: 2026-06-04
💡 一句话要点
提出Triple-Shot组合以解决单一裁剪的叙事不足问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多镜头构图 视觉叙事 美学裁剪 伪标签策略 深度学习
📋 核心要点
- 现有方法主要集中于生成单一美观裁剪,未能考虑多镜头组合的叙事需求,限制了创意表达的多样性。
- 本文提出Triple-Shot Compositions (TSC)任务,通过生成三种不同类型的镜头组合,增强图像的叙事能力。
- 实验结果表明,ShotCrop在镜头定位准确性上平均提升了2.82倍,相较于GPT-5表现显著更佳。
📝 摘要(中文)
现有的美学构图研究通常只生成单一的美观裁剪,忽视了从一个场景中构成多个镜头的叙事价值。多镜头构图对于创意工作流程至关重要,商业海报常常需要不同重点的多个裁剪(如背景、主体和情感/产品细节)来呈现关键故事情节。因此,本文提出了Triple-Shot Compositions (TSC)任务,从单一以人为中心的图像生成一组三个镜头(建立镜头、中景镜头和特写镜头),每个镜头配有简短的镜头描述以支持视觉叙事。为在有限的专家注释下学习TSC,本文引入了ShotCrop,采用三阶段训练过程,最终通过为其量身定制的复合奖励优化ShotCrop。
🔬 方法详解
问题定义:本文旨在解决现有美学构图方法仅生成单一裁剪的问题,缺乏对多镜头叙事的考虑,限制了创意表达的丰富性。
核心思路:提出Triple-Shot Compositions (TSC)任务,通过从单一图像生成建立镜头、中景镜头和特写镜头的组合,增强视觉叙事能力。
技术框架:整体架构分为三个阶段:首先进行Chain-of-Thought监督微调,建立基本的推理和美学裁剪技能;其次进行半监督微调,利用高置信度伪标签进一步提升美学能力;最后采用针对ShotCrop的Group Relative Policy Optimization (GRPO-S)进行优化。
关键创新:最重要的创新在于伪标签策略的设计,结合了MLLM评分、美学评估和CLIP相似度,以保留高置信度的训练信号,这一策略显著提升了模型的学习效率和效果。
关键设计:在训练过程中,采用复合奖励机制来优化模型表现,确保生成的镜头不仅美观且具有叙事性。
🖼️ 关键图片
📊 实验亮点
实验结果显示,ShotCrop在镜头定位准确性上平均提升了2.82倍,相较于基线模型GPT-5,表现显著优越。这一成果表明了新方法在多镜头构图任务中的有效性和潜力。
🎯 应用场景
该研究的潜在应用领域包括广告设计、电影制作和社交媒体内容创作等,能够帮助创作者更有效地传达故事情节和情感。未来,随着技术的进步,可能会在自动化内容生成和增强现实等领域发挥更大作用。
📄 摘要(原文)
Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.