cs.CV(2025-12-04)

📊 共 41 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱一:机器人控制 (Robot Control) (6 🔗2) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (3) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
1 Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting Splannequin:利用双重检测 Splatting 冻结单目人体雕塑挑战视频 gaussian splatting splatting scene reconstruction
2 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer 提出4DLangVGGT,用于高效且可泛化的4D语言-视觉几何联合理解 gaussian splatting splatting scene understanding
3 RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS 提出RobustSplat++以解决动态与光照影响下的3D高斯渲染问题 3D gaussian splatting 3DGS gaussian splatting
4 Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition 提出基于门控机制的多模态自适应融合网络,用于提升人类行为识别精度。 optical flow multimodal
5 The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation 分析SAM2到SAM3的断层:探究提示工程在概念驱动图像分割中的失效原因 open-vocabulary open vocabulary foundation model
6 Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization 提出高斯熵场以驱动3D高斯优化中的自适应稀疏性 3D gaussian splatting 3DGS gaussian splatting
7 LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging LiteVGGT:通过几何感知缓存Token合并加速VGGT,实现大规模场景高效3D重建。 VGGT foundation model
8 SAM3-I: Segment Anything with Instructions SAM3-I:通过指令感知的级联自适应机制增强SAM3,实现指令驱动的图像分割 open-vocabulary open vocabulary instruction following
9 Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot 提出基于视觉-语言分割融合的恶意图像分析方法,实现一步到位的内容检测、元素识别和定位。 open-vocabulary open vocabulary
10 UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes UTrice:通过三角形统一可微光线追踪与栅格化,用于基于粒子的3D场景渲染 splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
11 Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning Semore:VLM引导的增强语义运动表征用于视觉强化学习 reinforcement learning motion representation large language model
12 ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning 提出ARM-Thinker以解决多模态奖励模型的验证问题 reinforcement learning multimodal instruction following
13 COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence COOPER:用于空间智能中协同感知与推理的统一模型 reinforcement learning spatial relationship large language model
14 EgoLCD: Egocentric Video Generation with Long Context Diffusion EgoLCD:基于长时上下文扩散的自我中心视角视频生成框架 world model egocentric embodied AI
15 Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks 提出稳定单像素对比学习,用于语义和几何任务 contrastive learning teacher-student
16 Generative Neural Video Compression via Video Diffusion Prior 提出基于视频扩散先验的生成式神经视频压缩框架GNVC-VD,解决感知视频压缩中的时域闪烁问题。 flow matching foundation model
17 ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching ReflexFlow:通过反思式优化学习目标缓解Flow Matching中的暴露偏差 flow matching
18 Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation 提出Reward Forcing框架,高效生成高质量流式视频,解决初始帧复制和动态不足问题。 distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
19 RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation 提出RAMEN:一种分辨率可调的多模态编码器,用于地球观测数据分析。 foundation model multimodal
20 Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark 提出视觉推理追踪基准VRT-Bench,用于评估多模态大语言模型在对象级别上的推理能力。 large language model multimodal visual grounding
21 EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture EMMA:提出一种高效统一的多模态理解、生成和编辑架构 multimodal
22 SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding 提出SEASON,通过自诊断对比解码缓解视频大语言模型中的时间幻觉问题 large language model
23 Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment 提出SANTA框架,通过自增强对比对齐缓解多模态LLM中的对象和动作幻觉问题 multimodal
24 Reflection Removal through Efficient Adaptation of Diffusion Transformers 提出基于扩散Transformer的高效自适应反射去除方法,显著提升图像恢复效果。 foundation model
25 Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild 提出MIND模型和ConvoInsight-DB数据集,解决野外对话心理分析中视觉歧义和评估难题。 visual grounding
26 I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models 提出I2I-Bench,一个全面的图像到图像编辑模型评测基准。 multimodal

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
27 X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale X-Humanoid:通过机器人化人类视频大规模生成类人机器人视频 humanoid humanoid robot world model
28 FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization FASTer:通过神经动作Token化实现高效自回归视觉-语言-动作建模 manipulation cross-embodiment vision-language-action
29 DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation DraCo:提出基于草图的思维链方法,用于文本到图像的预览和罕见概念生成 manipulation classifier-free guidance large language model
30 Towards Cross-View Point Correspondence in Vision-Language Models 提出CrossPoint-Bench和CroPond模型,解决视觉语言模型中跨视角点对应难题 manipulation affordance embodied AI
31 Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints 提出基于生成先验和接触约束的物体遮挡重建方法,提升机器人操作性能。 manipulation
32 BulletTime: Decoupled Control of Time and Camera Pose for Video Generation BulletTime:解耦时间和相机姿态的视频生成框架,实现精确的4D控制。 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
33 Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model 基于扩散模型的人体运动生成:运动表征对性能影响的深度分析 motion diffusion model MDM motion diffusion
34 Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image 提出MoRe4D,联合进行3D几何重建和运动生成,从单张图像合成4D场景。 motion generation spatiotemporal
35 Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing 提出BioTUCH,利用生物阻抗感知优化人体姿态估计,解决自接触场景难题。 motion generation contact-aware

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
36 Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence 提出HeFT,利用视频扩散先验实现鲁棒的零样本点跟踪 spatiotemporal foundation model
37 Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model 提出一种基于循环融合模型的面部视频酒精中毒检测方法 spatiotemporal
38 WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism 提出基于注意力机制的WiFi跨域手势识别方法,提升泛化能力。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
39 E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving E3AD:提出情感感知的端到端自动驾驶模型,提升人机交互体验 egocentric motion estimation vision-language-action
40 Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization 提出AionHMR框架,实现年龄包容的3D人体网格重建,用于保护隐私的数据匿名化。 human mesh recovery SMPL

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
41 PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement 提出PhyVLLM,通过运动-外观解耦的物理引导视频语言模型,提升物理推理能力。 motion representation large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页