cs.CV(2026-02-27)

📊 共 42 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation 提出自适应视觉增强AIR框架,缓解多模态大语言模型中的幻觉问题 large language model multimodal
2 GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models GuardAlign:多模态大语言模型中基于测试时对齐的安全防御框架 large language model multimodal
3 Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping Venus:提升多模态大语言模型的美学指导与裁剪能力 large language model multimodal
4 Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics 提出TASOT,利用多模态最优传输实现手术机器人视频的无监督时序分割 multimodal zero-shot transfer
5 A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification 提出多模态切片发现框架,用于医学图像分类中系统性错误检测与解释 multimodal
6 PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning PointCoT:提出用于3D几何推理的多模态基准,解决MLLM在点云理解中的几何幻觉问题。 large language model multimodal chain-of-thought
7 Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models 利用预训练模型,提出一种鲁棒的小样本3D血管分割方法,有效应对数据匮乏和领域迁移问题。 foundation model
8 Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand rs-embed:遥感基础模型嵌入按需获取,解决异构性难题 foundation model
9 Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering 提出SDLS方法,通过语义解耦潜在空间引导,抑制放射报告生成中的历史对比幻觉。 large language model foundation model zero-shot transfer
10 Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? VGUBench揭示了统一多模态大模型在跨模态语义对齐上的不足 large language model multimodal
11 HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit HiDrop:通过分层视觉Token缩减提升多模态大语言模型的效率。 large language model multimodal
12 Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection 提出SteerVAD,通过引导和修正冻结多模态LLM中的潜在表征流形,解决视频异常检测问题。 large language model
13 Interpretable Debiasing of Vision-Language Models for Social Fairness 提出DeBiasLens,通过可解释方式消除视觉-语言模型中的社会偏见 multimodal
14 Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks 提出Ref-Adv基准,揭示MLLM在指代表达理解中视觉推理的局限性 multimodal
15 UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking UTPTrack:提出一种简单统一的Token剪枝框架,用于提升视觉跟踪效率。 multimodal
16 DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model 提出DLEBench,用于评估指令驱动图像编辑模型在小目标编辑上的能力。 instruction following

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
17 Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought 提出数值视觉思维链(NV-CoT),实现多模态大语言模型中基于连续坐标的图像推理。 reinforcement learning large language model multimodal
18 Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning 提出HART框架,通过强化学习实现高分辨率大模型无标注视觉推理 reinforcement learning multimodal
19 Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models 提出伪对比学习方法,提升多模态模型在图表理解中的性能 contrastive learning multimodal
20 VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video VideoPulse:通过面部视频非接触式估计新生儿心率和血氧饱和度 MAE PULSE
21 DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution 提出DACESR,利用退化感知条件嵌入增强真实世界图像超分辨率效果 Mamba contrastive learning multimodal
22 A Mixed Diet Makes DINO An Omnivorous Vision Encoder 提出Omnivorous Vision Encoder,解决DINOv2跨模态特征对齐问题 distillation foundation model
23 Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation 提出RETA框架,通过动态检索和拓扑对齐提升数据集蒸馏的泛化能力。 distillation
24 MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation MSVBench:面向人类水平的多镜头视频生成评估基准 world model multimodal
25 See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent 提出Sea$^2$,通过个性化VLM引导的智能体实现无监督跨域视觉自适应 reinforcement learning visual grounding
26 AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors AHAP:提出一种无需相机标定的任意视角人体三维重建框架 contrastive learning SMPL

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
27 Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives 提出自适应剪枝与差分高斯基元的紧凑3D高斯溅射方法,提升渲染质量并显著降低模型尺寸。 3D gaussian splatting 3DGS gaussian splatting
28 SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting 提出SR3R,通过前馈高斯溅射实现3D超分辨率重建,提升泛化性和实时性。 3D gaussian splatting 3DGS gaussian splatting
29 Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition 提出ReSeg-CLIP以解决遥感数据的开放词汇语义分割问题 open-vocabulary open vocabulary
30 Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning 提出HDFLIM,通过超维计算对齐冻结的语言和图像模型,实现高效图像描述生成。 semantic mapping semantic map foundation model
31 Evidential Neural Radiance Fields 提出Evidential NeRF,在NeRF中实现高质量场景重建和不确定性量化。 NeRF neural radiance field scene reconstruction
32 DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer DiffusionHarmonizer:利用在线扩散增强器桥接神经重建与照片级真实感仿真 3D gaussian splatting gaussian splatting splatting
33 No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency 提出一种无需标定和深度信息的跨模态视图合成方法,实现3D一致性。 3D gaussian splatting 3DGS gaussian splatting
34 Altitude-Aware Visual Place Recognition in Top-Down View 提出一种高度适应的视觉地点识别方法以解决高空变化问题 depth estimation metric depth
35 Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion 提出双分支特征提取融合网络,提升微表情识别精度 optical flow

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
36 AoE: Always-on Egocentric Human Video Collection for Embodied AI 提出AoE系统,利用智能手机和颈戴支架,低成本高效收集第一人称视角人类交互视频数据。 egocentric embodied AI foundation model
37 EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding 提出EgoGraph,用于理解以自我为中心的超长时序视频,解决现有方法在长期依赖建模上的不足。 egocentric
38 Egocentric Visibility-Aware Human Pose Estimation 提出EvaPose,解决第一人称视角人体姿态估计中关键点不可见性问题 egocentric

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
39 U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation U-Mind:用于实时多模态交互的统一框架,支持视听生成 motion generation multimodal instruction following

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
40 Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation 提出基于3D几何先验的动作-几何预测方法,用于双臂操作任务 manipulation bi-manual bimanual manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
41 FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking 提出FocusTrack,通过单阶段聚焦抑制框架实现高性能3D点云目标跟踪 motion estimation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
42 SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking SpikeTrack:一种用于高效视觉跟踪的脉冲驱动框架 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页