cs.CV（2026-04-15）

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (3 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱八：物理动画 (Physics-based Animation) (1) 其他 (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models	提出Delta-LLaVA，统一遥感变化检测与理解的多模态大语言模型框架	large language model multimodal
2	Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding	提出UniRect-CoT框架，利用统一多模态模型内在理解能力提升生成质量。	multimodal chain-of-thought
3	Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning	提出FiMR框架，通过细粒度多模态推理增强文本到图像生成。	large language model multimodal
4	Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks	揭示多模态上下文学习滞后原因，分析其内在机制与瓶颈	large language model multimodal	✅
5	POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch	提出POINTS-Seeker，从零训练多模态Agentic搜索模型，解决长程知识密集型视觉推理难题。	multimodal
6	A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy	提出多模态临床信息融合的粗到细配准框架，用于质子治疗中的纵向CT配准。	multimodal
7	ROSE: Retrieval-Oriented Segmentation Enhancement	提出ROSE框架，通过检索增强解决多模态大语言模型在分割新兴实体时的知识不足问题	large language model multimodal
8	Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios	提出DailyClue基准，评估MLLM在日常场景中基于视觉线索的推理能力	large language model multimodal
9	SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs	提出SLQ：通过共享隐空间查询桥接模态，实现冻结MLLM的检索	large language model multimodal
10	One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding	提出LP-Comp和QC-Comp，实现长视频理解的极端压缩，提升VLM性能。	large language model
11	Training-Free Semantic Multi-Object Tracking with Vision-Language Models	提出TF-SMOT，一种无需训练的语义多目标跟踪框架，提升视频理解能力。	foundation model
12	Context Sensitivity Improves Human-Machine Visual Alignment	提出上下文敏感相似度计算方法，提升人机视觉对齐效果	foundation model
13	Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning	提出动态Token选择与微调方法，高效实现多视角3D目标检测。	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis	提出Dehaze-then-Splat，用于烟雾去除和新视角合成。	3D gaussian splatting 3DGS gaussian splatting
15	ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction	ClipGStream：提出一种用于任意长度和运动多视角动态场景重建的Clip-Stream高斯溅射方法	gaussian splatting splatting scene reconstruction	✅
16	VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation	VGGT-Segmentor：提出几何增强的跨视角分割框架，解决视角差异下的实例分割难题。	VGGT egocentric embodied AI
17	Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself	提出Free Geometry，通过自监督微调提升单目3D重建精度	Depth Anything VGGT foundation model	✅
18	PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction	PartNerFace：基于部件的神经辐射场，用于可动画人脸Avatar重建	neural radiance field
19	Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation	提出基于生成式深度估计的3D线框重建方法，实现从单张草图到3D模型的转换。	depth estimation
20	DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis	提出大规模无干扰物新视角合成数据集DF3DV-1K，促进相关方法研究。	3D gaussian splatting gaussian splatting splatting
21	Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens	提出IR4Net，通过物理引导的光学反演实现对隔离屏幕的非接触式侧信道攻击。	semantic mapping semantic map

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
22	HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System	HiVLA：一种视觉中心的分层具身操作系统，解耦规划与控制	manipulation flow matching vision-language-action
23	ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation	ESCAPE：结合情景空间记忆与自适应策略，解决长时程移动操作任务	manipulation mobile manipulation embodied AI
24	Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation	早期视觉皮层对齐可提升视觉-语言模型对抗诱导的抵抗力	manipulation	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Towards Unconstrained Human-Object Interaction	提出U-HOI任务，利用多模态大语言模型解决无约束人-物交互检测问题	human-object interaction HOI large language model	✅
26	A Study of Failure Modes in Two-Stage Human-Object Interaction Detection	针对两阶段HOI检测模型，研究其在复杂场景和罕见交互下的失效模式	human-object interaction HOI multi-person interaction
27	OneHOI: Unifying Human-Object Interaction Generation and Editing	OneHOI统一人-物交互生成与编辑，实现混合条件下的场景合成与交互修改。	human-object interaction HOI	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking	MambaTrack：提出事件自适应状态转移和门控融合的RGB-Event目标跟踪框架	Mamba state space model multimodal
29	Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models	提出音频对比偏好优化ACPO，解决视听语言模型中视频驱动的音频幻觉问题	preference learning multimodal
30	Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective	提出面向前馈3D场景建模的问题驱动视角，实现高效通用的三维重建。	world model world models

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation	提出SceneGlue，利用场景感知Transformer进行无场景标注的特征匹配	feature matching	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance	SocialMirror：利用语义和几何引导，从单目视频重建3D人体交互行为	spatial relationship

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization	提出UniBlendNet，用于统一建模全局、多尺度和区域自适应的环境光照归一化	UniCon

📄 其他

#	题目	一句话要点	标签	🔗	⭐
34	Geometric Context Transformer for Streaming 3D Reconstruction	提出基于几何上下文Transformer的LingBot-Map，用于高效稳定的流式3D重建。

⬅️ 返回 cs.CV 首页 · 🏠 返回主页