cs.CV(2026-03-16)

📊 共 61 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (14 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗1) 支柱一:机器人控制 (Robot Control) (11 🔗3) 支柱八:物理动画 (Physics-based Animation) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC 提出一种多模态深度学习框架,用于预测非小细胞肺癌新辅助治疗后的病理缓解情况。 foundation model multimodal
2 MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal 提出MER-Bench:一个用于多模态Meme内容重构的综合基准 large language model multimodal
3 DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery DamageArbiter:一种CLIP增强的多模态仲裁框架,用于街景图像的飓风灾害评估 large language model multimodal
4 Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning 提出基于交叉注意力的多模态图学习框架,用于自闭症谱系障碍的分类。 multimodal
5 VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents VAREX:一个用于评估多模态文档结构化信息提取的基准 foundation model multimodal instruction following
6 Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation 提出TAEMI框架,利用文本锚定和跨模态注意力,提升噪声环境下情感模仿强度估计的鲁棒性。 multimodal
7 Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task 提出EscapeCraft-4D环境,评估大模型在时序感知和跨模态主动感知方面的能力 large language model multimodal
8 GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents 提出GUI-CEval以解决中文移动GUI代理评估不足问题 large language model multimodal
9 A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding 提出SAMA框架和MVX-Bench基准,用于提升多视频理解中的跨视频推理能力。 large language model multimodal
10 Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments 针对骨骼动作识别中的严重领域偏移,提出基于微调门控机制的校准方法。 zero-shot transfer
11 HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning HalDec-Bench:用于图像描述幻觉检测的综合基准测试平台 multimodal
12 HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization 提出HYDRA,通过表征协调的Token化统一多模态生成与理解。 multimodal
13 Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding 提出QViC-MF框架,利用记忆反馈提升长视频理解中时序事件的建模能力。 multimodal
14 MMSpec: Benchmarking Speculative Decoding for Vision-Language Models MMSpec:针对视觉-语言模型推测解码的基准测试与ViSkip加速方法 multimodal
15 GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM 提出GT-PCQA,利用MLLM解决点云质量评估中几何结构敏感性不足的问题。 large language model
16 Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs 提出PromPrune,通过语义显著性感知预算分配实现VLM视觉token自适应压缩。 multimodal
17 Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection 提出Two Birds, One Projection,通过推理时特征投影调和LVLM的安全性与效用。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (14 篇)

#题目一句话要点标签🔗
18 E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction E2EGS:基于事件流到边缘高斯溅射的无位姿三维重建 depth estimation 3D gaussian splatting 3DGS
19 Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation SpecDepth:通过频谱校正实现结肠镜深度估计中基础模型的高效自适应 depth estimation monocular depth foundation model
20 Panoramic Affordance Prediction 提出PAP框架,解决全景图像下可供性预测难题,并构建大规模PAP-12K数据集。 scene understanding affordance spatial relationship
21 AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving AutoMoT:用于端到端自动驾驶的异步混合Transformer统一视觉-语言-动作模型 scene understanding vision-language-action VLA
22 IRIS: Intersection-aware Ray-based Implicit Editable Scenes IRIS:提出交点感知的光线隐式可编辑场景,实现高效交互式编辑。 3D gaussian splatting gaussian splatting splatting
23 RieMind: Geometry-Grounded Spatial Agent for Scene Understanding RieMind:基于几何感知的空间智能体用于场景理解 scene understanding spatial relationship
24 Fractal Autoregressive Depth Estimation with Continuous Token Diffusion 提出基于分形自回归扩散的单目深度估计框架,解决RGB-D模态差异和生成效率问题。 depth estimation monocular depth
25 Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3 提出基于循环网络的深度估计方法,用于单目热成像ORB-SLAM3在弱光环境下的定位。 depth estimation
26 Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image DynaAvatar:单图零样本重建具有布料动态效果的可动3D人体化身 optical flow SMPL SMPL-X
27 Pointing-Based Object Recognition 提出基于指向手势的物体识别流水线,提升人机交互中目标识别准确率 depth estimation monocular depth
28 Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation 提出EDA-PSeg框架,解决跨域全景语义分割中的几何畸变和语义不一致问题。 scene understanding
29 Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context 提出自适应残差上下文网络,用于城市交通图像中自动驾驶车辆的检测。 scene understanding
30 Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization 提出FreeOmniMVS,通过多视角一致性最大化实现无参考全向立体匹配 depth estimation
31 $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling 提出F²HDR,通过光流适配和物理运动建模实现高质量HDR视频重建 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
32 Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching 提出 Riemannian Motion Generation (RMG),用于解决人体运动生成中的非欧几何建模问题。 flow matching motion generation human motion
33 TrajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective 提出基于自车运动引导的Mamba模型TrajMamba,用于自中心视角的行人轨迹预测。 Mamba egocentric TAMP
34 Self-Distillation of Hidden Layers for Self-Supervised Representation Learning 提出Bootleg,通过多层隐层自蒸馏提升自监督表征学习性能。 representation learning MAE distillation
35 CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models CyCLeGen:视觉基础模型中循环一致的布局预测与图像生成 reinforcement learning foundation model
36 SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space SemanticFace:通过可解释空间中的语义蒸馏实现语义人脸动作估计 distillation large language model multimodal
37 AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild 提出AURORA-KITTI数据集,并构建基于蒸馏的深度补全与去噪基线DDCD,提升恶劣天气下的鲁棒性。 distillation metric depth scene understanding
38 GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering 提出GlyphPrinter,通过区域分组直接偏好优化实现字形精确的视觉文本渲染 reinforcement learning DPO direct preference optimization
39 DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer 提出DAIT,通过自适应中间教师迁移,将视觉-语言模型知识蒸馏到轻量级分类器。 distillation multimodal
40 EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing 提出EditHF-1M:百万级图像编辑人类偏好反馈数据集与奖励模型 reinforcement learning large language model multimodal
41 Trajectory-Diversity-Driven Robust Vision-and-Language Navigation 提出NavGRPO,通过轨迹多样性驱动的强化学习提升视觉语言导航的鲁棒性 reinforcement learning imitation learning VLN
42 Real-Time Human Frontal View Synthesis from a Single Image PrismMirror:提出一种几何引导的单目图像实时人脸正视图合成框架 linear attention SMPL SMPL-X
43 Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization 提出FSENet,利用面部特征增强弱监督时序情感定位的边界识别能力。 contrastive learning multimodal

🔬 支柱一:机器人控制 (Robot Control) (11 篇)

#题目一句话要点标签🔗
44 Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery Fast SAM 3D Body:加速SAM 3D Body实现实时全身人体网格重建 humanoid humanoid control manipulation
45 HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions HSImul3R:提出物理引擎闭环的人-场景交互三维重建方法 humanoid humanoid robot reinforcement learning
46 Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation WorldDrive:统一视觉与运动表征,实现场景生成与规划的桥梁 motion planning world model motion representation
47 MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model MVHOI:通过3D基础模型桥接多视角条件,实现复杂人-物交互视频重演 manipulation human-object interaction HOI
48 Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models DeepVision-VLA:增强视觉基础表示,提升视觉-语言-动作模型的操作性能 manipulation vision-language-action VLA
49 Towards Generalizable Robotic Manipulation in Dynamic Environments 提出PUMA模型和DOMINO数据集,提升动态环境中机器人操作的泛化性 manipulation optical flow spatiotemporal
50 Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning Edit2Interp:利用少量样本将图像编辑基础模型适配到视频帧插值任务 manipulation motion estimation foundation model
51 RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation RealVLG-R1:用于机器人感知与操作的大规模真实世界视觉-语言定位基准 manipulation policy learning multimodal
52 Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion Tri-Prompting:提出统一框架,实现对视频扩散模型场景、主体和运动的联合控制。 manipulation
53 AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer AC-Foley:基于参考音频的视频到音频合成,实现精细声学迁移 manipulation
54 Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents 提出双通道对比分类方法,防御计算机使用代理中的视觉混淆副手攻击 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
55 HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System HiMemVLN:通过分层记忆系统增强开源零样本视觉语言导航的可靠性 spatiotemporal VLN multimodal
56 AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation AnyCrowd:提出实例隔离的身份-姿态绑定方法,用于任意多角色动画生成。 character animation
57 Efficient Event Camera Volume System 提出EECVS高效事件相机体素系统,自适应压缩提升下游任务性能。 PULSE TAMP

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
58 Kimodo: Scaling Controllable Human Motion Generation Kimodo:基于大规模运动捕捉数据的可控人体运动生成模型 motion diffusion model motion diffusion motion synthesis
59 ReactMotion: Generating Reactive Listener Motions from Speaker Utterance 提出ReactMotion,用于从说话人话语生成反应式听者动作 motion generation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
60 Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling 提出F2F-JF多人交互视频数据集,用于建模人际对话中的反应时序关系。 multi-person interaction

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
61 GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis GeoNVS:基于几何约束的视频扩散模型,用于高质量新视角合成 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页