cs.CV（2025-12-04）

📊 共 42 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱九：具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱一：机器人控制 (Robot Control) (6 🔗2) 支柱四：生成式动作 (Generative Motion) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (3) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting	Splannequin：利用双重检测 Splatting 冻结单目人体雕塑挑战视频	gaussian splatting splatting scene reconstruction	✅
2	4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer	提出4DLangVGGT，用于高效且可泛化的4D语言-视觉几何联合理解	gaussian splatting splatting scene understanding	✅
3	RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS	提出RobustSplat++以解决动态与光照影响下的3D高斯渲染问题	3D gaussian splatting 3DGS gaussian splatting
4	Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition	提出基于门控机制的多模态自适应融合网络，用于提升人类行为识别精度。	optical flow multimodal
5	The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation	分析SAM2到SAM3的断层：探究提示工程在概念驱动图像分割中的失效原因	open-vocabulary open vocabulary foundation model
6	Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization	提出高斯熵场以驱动3D高斯优化中的自适应稀疏性	3D gaussian splatting 3DGS gaussian splatting
7	LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging	LiteVGGT：通过几何感知缓存Token合并加速VGGT，实现大规模场景高效3D重建。	VGGT foundation model	✅
8	SAM3-I: Segment Anything with Instructions	SAM3-I：通过指令感知的级联自适应机制增强SAM3，实现指令驱动的图像分割	open-vocabulary open vocabulary instruction following
9	Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot	提出基于视觉-语言分割融合的恶意图像分析方法，实现一步到位的内容检测、元素识别和定位。	open-vocabulary open vocabulary
10	UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes	UTrice：通过三角形统一可微光线追踪与栅格化，用于基于粒子的3D场景渲染	splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning	Semore：VLM引导的增强语义运动表征用于视觉强化学习	reinforcement learning motion representation large language model
12	ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning	提出ARM-Thinker以解决多模态奖励模型的验证问题	reinforcement learning multimodal instruction following
13	COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence	COOPER：用于空间智能中协同感知与推理的统一模型	reinforcement learning spatial relationship large language model
14	EgoLCD: Egocentric Video Generation with Long Context Diffusion	EgoLCD：基于长时上下文扩散的自我中心视角视频生成框架	world model egocentric embodied AI	✅
15	Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks	提出稳定单像素对比学习，用于语义和几何任务	contrastive learning teacher-student
16	Generative Neural Video Compression via Video Diffusion Prior	提出基于视频扩散先验的生成式神经视频压缩框架GNVC-VD，解决感知视频压缩中的时域闪烁问题。	flow matching foundation model
17	ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching	ReflexFlow：通过反思式优化学习目标缓解Flow Matching中的暴露偏差	flow matching
18	Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation	提出Reward Forcing框架，高效生成高质量流式视频，解决初始帧复制和动态不足问题。	distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
19	RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation	提出RAMEN：一种分辨率可调的多模态编码器，用于地球观测数据分析。	foundation model multimodal	✅
20	Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark	提出视觉推理追踪基准VRT-Bench，用于评估多模态大语言模型在对象级别上的推理能力。	large language model multimodal visual grounding
21	EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture	EMMA：提出一种高效统一的多模态理解、生成和编辑架构	multimodal
22	SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding	提出SEASON，通过自诊断对比解码缓解视频大语言模型中的时间幻觉问题	large language model
23	Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment	提出SANTA框架，通过自增强对比对齐缓解多模态LLM中的对象和动作幻觉问题	multimodal
24	Reflection Removal through Efficient Adaptation of Diffusion Transformers	提出基于扩散Transformer的高效自适应反射去除方法，显著提升图像恢复效果。	foundation model
25	Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild	提出MIND模型和ConvoInsight-DB数据集，解决野外对话心理分析中视觉歧义和评估难题。	visual grounding
26	I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models	提出I2I-Bench，一个全面的图像到图像编辑模型评测基准。	multimodal

🔬 支柱一：机器人控制 (Robot Control) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
27	X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale	X-Humanoid：通过机器人化人类视频大规模生成类人机器人视频	humanoid humanoid robot world model
28	FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization	FASTer：通过神经动作Token化实现高效自回归视觉-语言-动作建模	manipulation cross-embodiment vision-language-action
29	DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation	DraCo：提出基于草图的思维链方法，用于文本到图像的预览和罕见概念生成	manipulation classifier-free guidance large language model
30	Towards Cross-View Point Correspondence in Vision-Language Models	提出CrossPoint-Bench和CroPond模型，解决视觉语言模型中跨视角点对应难题	manipulation affordance embodied AI	✅
31	Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints	提出基于生成先验和接触约束的物体遮挡重建方法，提升机器人操作性能。	manipulation
32	BulletTime: Decoupled Control of Time and Camera Pose for Video Generation	BulletTime：解耦时间和相机姿态的视频生成框架，实现精确的4D控制。	manipulation	✅

🔬 支柱四：生成式动作 (Generative Motion) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Controllable Long-term Motion Generation with Extended Joint Targets	COMET：基于Transformer的实时可控长时程人体运动生成框架	motion generation long-term motion generation character control
34	Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model	基于扩散模型的人体运动生成：运动表征对性能影响的深度分析	motion diffusion model MDM motion diffusion
35	Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image	提出MoRe4D，联合进行3D几何重建和运动生成，从单张图像合成4D场景。	motion generation spatiotemporal	✅
36	Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing	提出BioTUCH，利用生物阻抗感知优化人体姿态估计，解决自接触场景难题。	motion generation contact-aware

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
37	Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence	提出HeFT，利用视频扩散先验实现鲁棒的零样本点跟踪	spatiotemporal foundation model
38	Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model	提出一种基于循环融合模型的面部视频酒精中毒检测方法	spatiotemporal
39	WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism	提出基于注意力机制的WiFi跨域手势识别方法，提升泛化能力。	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
40	E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving	E3AD：提出情感感知的端到端自动驾驶模型，提升人机交互体验	egocentric motion estimation vision-language-action
41	Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization	提出AionHMR框架，实现年龄包容的3D人体网格重建，用于保护隐私的数据匿名化。	human mesh recovery SMPL

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
42	PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement	提出PhyVLLM，通过运动-外观解耦的物理引导视频语言模型，提升物理推理能力。	motion representation large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页