cs.CV(2026-03-19)

📊 共 61 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗2) 支柱一:机器人控制 (Robot Control) (6 🔗1) 支柱四:生成式动作 (Generative Motion) (5) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 Towards Interpretable Foundation Models for Retinal Fundus Images 提出Dual-IFM,用于视网膜眼底图像的可解释性基础模型 foundation model
2 LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs LVOmniBench:首个面向全模态LLM的长音频视频理解评测基准 large language model multimodal
3 CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models 提出CoDA框架,评估并提升医学视觉-语言模型在临床流程中的鲁棒性。 large language model multimodal
4 To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs 提出三层诊断框架,揭示视觉语言模型中的视觉迎合现象和分裂信念 instruction following visual grounding
5 Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens 提出CubiD:首个面向高维离散表示的扩散模型,用于视觉生成任务。 multimodal
6 Tinted Frames: Question Framing Blinds Vision-Language Models 揭示视觉语言模型对问题框架的敏感性,并提出提示调优方法以提升视觉关注。 visual grounding
7 SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues SAVeS:通过语义线索操纵视觉-语言模型中的安全判断 multimodal
8 SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation SignAgent:利用Agentic LLM进行语言学驱动的手语标注与数据集构建 large language model
9 SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation SwiftTailor:利用几何图像表示高效生成3D服装 multimodal
10 SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models 提出稀疏嵌入调制(SEM),用于视觉-语言模型的事后去偏。 multimodal
11 Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token 提出SELF1E,仅用单个分割token实现多模态大语言模型(MLLM)的无解码器图像分割。 large language model
12 Motion-o: Trajectory-Grounded Video Reasoning 提出Motion-o,通过显式轨迹推理增强视频理解中的时空推理能力。 chain-of-thought
13 Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning 提出一种基于说话人情感表达的双模型,用于预测视频学习中的情感参与度和声音吸引力。 multimodal
14 Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA 提出Click-to-Ask,用于直播电商的AI助手,实现离线文案生成与在线互动问答。 multimodal
15 T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World 提出T-QPM框架,增强视觉-语言模型在动态开放世界中的OOD检测和领域泛化能力 multimodal
16 Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis Gastric-X:用于胃癌分析的多模态多阶段基准数据集,促进视觉-语言模型发展。 multimodal
17 Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following 提出免指令调优方法,提升医学视觉语言模型在指令跟随任务上的性能。 instruction following
18 ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding 提出ReXInTheWild,用于评估视觉-语言模型理解医学照片的统一基准 large language model multimodal
19 Narrative Aligned Long Form Video Question Answering 提出NA-VQA基准和Video-NaRA框架,解决长视频叙事推理难题 large language model multimodal
20 Tinted Frames: Question Framing Blinds Vision-Language Models 揭示视觉语言模型对问题框架的敏感性,并提出提示调优方法提升视觉 grounding。 visual grounding

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
21 Multimodal Model for Computational Pathology:Representation Learning and Image Compression 多模态计算病理学模型:面向WSI的表征学习与图像压缩综述 representation learning foundation model multimodal
22 Few-shot Acoustic Synthesis with Multimodal Flow Matching 提出FLAC,利用多模态Flow Matching实现少样本声学合成 flow matching PULSE multimodal
23 DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection 提出DA-Mamba,利用领域感知状态空间模型实现领域自适应目标检测中的全局-局部对齐。 Mamba SSM state space model
24 From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions AI驱动蛋白质科学:从静态结构预测到生成动态和多模态交互 flow matching multimodal
25 Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders 探索VLMs视觉编码新选择:评估状态空间模型作为视觉Transformer的替代方案 SSM state space model large language model
26 DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding DriveTok:提出用于统一多视角重建和理解的3D驾驶场景Token化方法。 world model vision-language-action foundation model
27 EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation EdgeCrafter:面向边缘密集预测任务的紧凑型ViT,通过任务专用蒸馏实现 representation learning distillation
28 Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation 提出Switch框架,利用多尺度和频域信息,提升医学超声图像分割的半监督和对比学习效果 contrastive learning teacher-student
29 Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching 提出基于对比学习的无监督3D谱形状匹配方法,提升效率与鲁棒性 contrastive learning
30 Foundations and Architectures of Artificial Intelligence for Motor Insurance 提出面向汽车保险的垂直整合AI框架,实现汽车风险评估和理赔流程自动化。 representation learning multimodal
31 CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think CRAFT:一种高效对齐扩散模型的新微调方法,数据需求大幅降低 reinforcement learning DPO
32 ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models ProactiveBench:用于评估多模态大语言模型主动性的基准测试 reinforcement learning large language model multimodal
33 Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity 提出SemanticFL,利用扩散模型语义一致性解决联邦学习中多模态异构问题 contrastive learning multimodal
34 LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray 提出LoFi,利用位置感知细粒度表征学习提升胸部X光片检索和短语定位性能 representation learning large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
35 OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting 提出OnlinePG,实现基于3D高斯溅射的在线开放词汇全景建图 3D gaussian splatting gaussian splatting splatting
36 Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting 提出Splat2BEV,通过3D高斯溅射学习几何对齐的BEV表示,提升自动驾驶感知性能。 3D gaussian splatting gaussian splatting splatting
37 GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning GSMem:利用3D高斯溅射作为持久空间记忆,实现零样本具身探索与推理 3D gaussian splatting 3DGS gaussian splatting
38 Matryoshka Gaussian Splatting 提出Matryoshka Gaussian Splatting,实现3DGS的连续LoD,且不牺牲全容量渲染质量。 3D gaussian splatting 3DGS gaussian splatting
39 VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation VGGT-360:提出几何一致的零样本全景深度估计框架 depth estimation VGGT foundation model
40 GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting GHOST:基于高斯溅射的快速、类别无关的RGB视频手-物交互重建 gaussian splatting splatting embodied AI
41 HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering HiMu:用于长视频问答的分层多模态帧选择框架,提升效率和准确率。 open-vocabulary open vocabulary multimodal
42 DROID-SLAM in the Wild 提出基于可微不确定性Bundle Adjustment的DROID-SLAM,解决动态环境下鲁棒SLAM问题。 DROID-SLAM
43 SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction SEAR:一种简单高效的视觉几何Transformer自适应方法,用于RGB+热成像3D重建 scene reconstruction multimodal
44 Matryoshka Gaussian Splatting 提出Matryoshka Gaussian Splatting,实现3DGS的连续细节层次调整且不损失全容量性能。 3D gaussian splatting 3DGS gaussian splatting
45 dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3 DINOv3.seg:利用DINOv3实现开放词汇语义分割,提升复杂场景下的分割精度和鲁棒性。 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
46 Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding 利用视频生成模型的隐式3D先验,提升场景理解能力 manipulation scene understanding spatiotemporal
47 Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning Inst4DGS:基于多视角标签置换学习的实例分解4D高斯溅射 trajectory optimization gaussian splatting splatting
48 MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model 提出MultihopSpatial基准,用于评估视觉语言模型在多跳组合空间推理中的能力,并应用于视觉语言动作智能体。 manipulation reinforcement learning vision-language-action
49 Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer 提出本体引导扩散(OGD)框架,用于零样本视觉Sim2Real迁移。 sim2real
50 MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction MonoArt:单目图像中可动3D物体重建的渐进式结构推理方法 manipulation scene reconstruction
51 Recognising BSL Fingerspelling in Continuous Signing Sequences 提出FS23K大规模BSL手指拼写数据集,并设计融合双手动交互和口型的识别模型 bi-manual

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
52 OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data OpenT2M:开源、大规模、高质量的文本到动作生成数据集与MonoFrill模型 text-to-motion motion generation motion tokenizer
53 Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer 提出基于扩散的离散运动Tokenizer MoTok,桥接语义与运动学条件下的动作生成。 motion synthesis motion generation motion tokenizer
54 PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance PhysVideo:利用跨视角几何引导生成物理上合理的视频 physically plausible
55 Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation Perceptio:通过空间Token生成增强视觉语言模型感知能力 VQ-VAE chain-of-thought
56 Improving Joint Audio-Video Generation with Cross-Modal Context Learning 提出跨模态上下文学习CCL,提升联合音视频生成质量与训练效率。 classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
57 NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data NymeriaPlus:通过增强标注和数据,扩展大规模第一人称视角活动数据集 SMPL egocentric human motion
58 SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation 提出SR-Nav,利用空间关系增强零样本目标导航的感知和规划能力 egocentric spatial relationship foundation model
59 SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions SurfaceXR:融合智能手表IMU与手部姿态,实现无缝表面交互 egocentric egocentric vision

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
60 Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos 提出SGC指标,用于评估动态生成视频中3D空间几何一致性 geometric consistency
61 TexEditor: Structure-Preserving Text-Driven Texture Editing TexEditor:提出结构保持的文本驱动纹理编辑方法 structure preservation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页