cs.CV(2026-05-21)

📊 共 61 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (29 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗1) 支柱七:动作重定向 (Motion Retargeting) (4) 支柱一:机器人控制 (Robot Control) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (29 篇)

#题目一句话要点标签🔗
1 Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis 提出融合多模态信息的MLLM增强方案,用于安全驾驶视频分析 large language model multimodal
2 PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought PointLLM-R:通过思维链增强3D点云推理能力 multimodal instruction following chain-of-thought
3 AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture AgroTools:农业领域工具增强型多模态Agent基准测试 large language model multimodal
4 Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding 提出Seizure-Semiology-Suite数据集与基准,用于评估和提升多模态大模型对癫痫发作症状学的理解能力。 large language model multimodal
5 Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure 针对冻结视觉基础模型的噪声鲁棒训练:跨数据集基准测试与小损失失效案例研究 foundation model
6 MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding 提出MOTOR数据集以解决两轮车骑行行为理解问题 multimodal
7 Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement 提出基于多模态知识图谱和可靠性引导的病例感知医学图像分类框架 multimodal
8 Bernini: Latent Semantic Planning for Video Diffusion Bernini:提出基于潜在语义规划的视频扩散模型,用于高质量视频生成与编辑。 large language model multimodal chain-of-thought
9 Accelerating Vision Foundation Models with Drop-in Depthwise Convolution 提出基于深度卷积的替代方案以加速视觉基础模型 foundation model
10 VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results VISTA:融合时空基础模型与解剖学解码,用于罕见病理VCE事件检测 foundation model
11 AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding AgroVG:用于农业视觉定位的大规模多源基准数据集 visual grounding
12 Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction 提出用于情感模仿强度预测的两阶段多模态融合框架,在Hume-ABAW10挑战赛中获得第三名。 multimodal
13 Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models 提出Director-Experts (DEX)模型,解决多模态医学影像中非独立同分布特征导致的表示坍塌问题。 foundation model
14 GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT 提出GLeVE框架,通过图引导和提案验证实现3D CT图像中病灶的精准定位。 foundation model multimodal
15 VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis 提出VGenST-Bench,通过主动视频合成评估多模态大语言模型中的时空推理能力。 large language model multimodal
16 GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning GeoWeaver:提出一种预推理几何 grounding 框架,提升视觉语言模型中的时空推理能力。 large language model multimodal
17 FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning 提出FashionLens以解决多样化时尚图像检索问题 large language model multimodal
18 EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning 提出EvoIR-Agent,通过经验驱动学习实现自进化图像修复智能体系统 large language model multimodal
19 Zero-Shot Temporal Action Localization Through Textual Guidance 提出TEGU,利用文本引导实现零样本时序动作定位,无需训练数据。 large language model zero-shot transfer
20 MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues 通过注意力线索揭示和恢复时间定位,提升MLLM在视频时序定位任务上的性能。 large language model multimodal
21 Cambrian-P: Pose-Grounded Video Understanding Cambrian-P:提出一种基于相机位姿的多模态视频理解模型,提升空间推理能力。 multimodal
22 DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders DecQ:通过细节浓缩查询增强表征自编码器的重建与生成能力 foundation model
23 Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models 提出CEDAR,通过稀疏解耦变换提升视觉-语言模型嵌入的可解释性,无需增加维度。 multimodal
24 SceneAligner: 3D-Grounded Floorplan Localization in the Wild SceneAligner:基于3D场景重建的室外环境平面图定位方法 foundation model
25 Translating Signals to Languages for sEMG-Based Activity Recognition 提出LLM-sEMG框架,利用大语言模型实现高精度sEMG信号活动识别 large language model
26 Direct content-based retrieval from music scores images 提出音乐乐谱图像直接内容检索方法,提升音乐信息检索效率 large language model
27 EventGait: Towards Robust Gait Recognition with Event Streams EventGait:利用事件流实现稳健的步态识别,尤其在低光照环境下表现出色。 foundation model
28 GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery GenHAR:面向末端配送的跨域人体活动识别泛化框架 foundation model
29 Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception Thermo-VL:扩展视觉-语言模型至热红外感知,提升低照度场景理解能力 visual grounding

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
30 Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts Pre-VLA:面向VLA模型和世界模型的可靠性,提出抢占式运行时验证架构。 world model world models vision-language-action
31 LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model LVDrive:基于潜在视觉表征增强的视觉-语言-动作自动驾驶模型 world model world models representation learning
32 CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models CrossVLA:跨范式VLA模型的后训练与推理优化 flow matching DPO vision-language-action
33 Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution 提出FlowGS,用于遥感图像连续尺度超分辨率重建,提升推理效率。 flow matching gaussian splatting splatting
34 Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning 提出CRPO以提升视频LLMs的时空敏感性问题 reinforcement learning spatiotemporal large language model
35 EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models EvoVid:面向视频大语言模型的时间中心自进化框架 reinforcement learning large language model
36 SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation SegCompass:利用稀疏自编码器实现可解释对齐,提升推理分割性能 reinforcement learning large language model chain-of-thought
37 From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding 提出ReceiptBench基准测试,并用度量感知强化学习优化MLLM在真实票据理解任务上的性能。 reinforcement learning large language model multimodal
38 RiT: Vanilla Diffusion Transformers Suffice in Representation Space RiT:仅用Vanilla Diffusion Transformer在表征空间实现高效图像生成 flow matching representation learning distillation
39 Ultra-High-Definition Image Quality Assessment via Graph Representation Learning 提出基于图表示学习的UHD-GCN-BIQA模型,提升超高清图像质量评估性能 representation learning
40 TextTeacher: What Can Language Teach About Images? TextTeacher:利用语言模型知识提升图像分类模型性能 distillation multimodal
41 Visual-Advantage On-Policy Distillation for Vision-Language Models 提出Visual-Advantage On-Policy Distillation,提升视觉语言模型对视觉输入的依赖 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
42 GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation 提出GA-VLN,利用几何感知BEV表示提升视觉语言导航效率与性能 3D reconstruction geometric consistency VLN
43 ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting ForeSplat:面向优化的前瞻性训练,加速3D高斯溅射重建 3D gaussian splatting 3DGS 3D reconstruction
44 TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting TWINGS:基于薄板样条的稀疏视角高斯溅射初始化方法 3D gaussian splatting 3DGS gaussian splatting
45 SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation 提出SpaceDG:用于评估视觉退化下多模态大模型空间智能的首个大规模基准数据集。 3D gaussian splatting 3DGS gaussian splatting
46 4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting 提出4D-GSW,解决4D高斯溅射中时空一致的水印嵌入问题。 gaussian splatting splatting spatiotemporal
47 Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving 提出Sensor2Sensor,将行车记录仪视频转换为自动驾驶所需的多模态传感器数据。 gaussian splatting splatting cross-embodiment
48 H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning 提出H-Flow,通过物理启发的自监督多模态学习实现人体场景流估计。 scene flow human motion
49 Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following 提出头部条件局部LoRA与视锥外惩罚,增强视觉基础模型中的注视推理能力 scene understanding foundation model
50 No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos NoPo4D:首个从无位姿多视角视频中进行前馈动态高斯建模的系统 3D gaussian splatting gaussian splatting splatting
51 Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction 提出GenRe,一种扩散模型引导的通用增强器,用于提升城市场景重建在未见视角的质量。 scene reconstruction
52 COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition 提出COCOTree数据集与基准,用于开放树结构视觉分解任务。 open-vocabulary open vocabulary
53 GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction GazePrior:通过学习3D注视重建实现零样本AR/VR眼动追踪 3D reconstruction

🔬 支柱七:动作重定向 (Motion Retargeting) (4 篇)

#题目一句话要点标签🔗
54 Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction 提出基于能量联合优化的上下文引导扩散模型,用于解决多智能体运动预测中多样性与一致性难题。 human motion human motion prediction motion prediction
55 AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild AnyMo:针对可穿戴设备,实现几何感知和环境无关的人体运动建模 human motion motion representation
56 AtomicMotion: Learning Human Motion From Different Human Parts AtomicMotion:通过解耦人体部位学习人体运动,提升AR/VR沉浸式体验。 human motion
57 SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data SADGE:通过结构和外观域差异估计合成数据与真实数据的性能差距 geometric consistency

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
58 From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model 提出BehaviorVLA,通过学习时序连贯的行为表示,提升VLA模型在分布偏移下的泛化能力。 manipulation sim-to-real Mamba
59 Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion 提出RTS:通过奖励引导的稀疏缩放优化扩散模型测试时性能 trajectory optimization

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
60 Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates 提出MDIC:一种利用多模态边信息进行极低码率分布式图像压缩的框架 VQ-VAE multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
61 Time-varying rPPG signal separation via block-sparse signal model 提出基于块稀疏信号模型的时变rPPG信号分离方法,解决光照变化下的信号提取难题。 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页