cs.CV(2026-04-15)

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (3 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1) 其他 (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models 提出Delta-LLaVA,统一遥感变化检测与理解的多模态大语言模型框架 large language model multimodal
2 Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding 提出UniRect-CoT框架,利用统一多模态模型内在理解能力提升生成质量。 multimodal chain-of-thought
3 Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning 提出FiMR框架,通过细粒度多模态推理增强文本到图像生成。 large language model multimodal
4 Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks 揭示多模态上下文学习滞后原因,分析其内在机制与瓶颈 large language model multimodal
5 POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch 提出POINTS-Seeker,从零训练多模态Agentic搜索模型,解决长程知识密集型视觉推理难题。 multimodal
6 A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy 提出多模态临床信息融合的粗到细配准框架,用于质子治疗中的纵向CT配准。 multimodal
7 ROSE: Retrieval-Oriented Segmentation Enhancement 提出ROSE框架,通过检索增强解决多模态大语言模型在分割新兴实体时的知识不足问题 large language model multimodal
8 Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios 提出DailyClue基准,评估MLLM在日常场景中基于视觉线索的推理能力 large language model multimodal
9 SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs 提出SLQ:通过共享隐空间查询桥接模态,实现冻结MLLM的检索 large language model multimodal
10 One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding 提出LP-Comp和QC-Comp,实现长视频理解的极端压缩,提升VLM性能。 large language model
11 Training-Free Semantic Multi-Object Tracking with Vision-Language Models 提出TF-SMOT,一种无需训练的语义多目标跟踪框架,提升视频理解能力。 foundation model
12 Context Sensitivity Improves Human-Machine Visual Alignment 提出上下文敏感相似度计算方法,提升人机视觉对齐效果 foundation model
13 Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning 提出动态Token选择与微调方法,高效实现多视角3D目标检测。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
14 Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis 提出Dehaze-then-Splat,用于烟雾去除和新视角合成。 3D gaussian splatting 3DGS gaussian splatting
15 ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction ClipGStream:提出一种用于任意长度和运动多视角动态场景重建的Clip-Stream高斯溅射方法 gaussian splatting splatting scene reconstruction
16 VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation VGGT-Segmentor:提出几何增强的跨视角分割框架,解决视角差异下的实例分割难题。 VGGT egocentric embodied AI
17 Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself 提出Free Geometry,通过自监督微调提升单目3D重建精度 Depth Anything VGGT foundation model
18 PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction PartNerFace:基于部件的神经辐射场,用于可动画人脸Avatar重建 neural radiance field
19 Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation 提出基于生成式深度估计的3D线框重建方法,实现从单张草图到3D模型的转换。 depth estimation
20 DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis 提出大规模无干扰物新视角合成数据集DF3DV-1K,促进相关方法研究。 3D gaussian splatting gaussian splatting splatting
21 Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens 提出IR4Net,通过物理引导的光学反演实现对隔离屏幕的非接触式侧信道攻击。 semantic mapping semantic map

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
22 HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System HiVLA:一种视觉中心的分层具身操作系统,解耦规划与控制 manipulation flow matching vision-language-action
23 ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation ESCAPE:结合情景空间记忆与自适应策略,解决长时程移动操作任务 manipulation mobile manipulation embodied AI
24 Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation 早期视觉皮层对齐可提升视觉-语言模型对抗诱导的抵抗力 manipulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (3 篇)

#题目一句话要点标签🔗
25 Towards Unconstrained Human-Object Interaction 提出U-HOI任务,利用多模态大语言模型解决无约束人-物交互检测问题 human-object interaction HOI large language model
26 A Study of Failure Modes in Two-Stage Human-Object Interaction Detection 针对两阶段HOI检测模型,研究其在复杂场景和罕见交互下的失效模式 human-object interaction HOI multi-person interaction
27 OneHOI: Unifying Human-Object Interaction Generation and Editing OneHOI统一人-物交互生成与编辑,实现混合条件下的场景合成与交互修改。 human-object interaction HOI

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
28 Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking MambaTrack:提出事件自适应状态转移和门控融合的RGB-Event目标跟踪框架 Mamba state space model multimodal
29 Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models 提出音频对比偏好优化ACPO,解决视听语言模型中视频驱动的音频幻觉问题 preference learning multimodal
30 Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective 提出面向前馈3D场景建模的问题驱动视角,实现高效通用的三维重建。 world model world models

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
31 SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation 提出SceneGlue,利用场景感知Transformer进行无场景标注的特征匹配 feature matching

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
32 SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance SocialMirror:利用语义和几何引导,从单目视频重建3D人体交互行为 spatial relationship

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
33 UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization 提出UniBlendNet,用于统一建模全局、多尺度和区域自适应的环境光照归一化 UniCon

📄 其他

#题目一句话要点标签🔗
34 Geometric Context Transformer for Streaming 3D Reconstruction 提出基于几何上下文Transformer的LingBot-Map,用于高效稳定的流式3D重建。

⬅️ 返回 cs.CV 首页 · 🏠 返回主页