cs.CV（2026-05-21）

📊 共 61 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (29 🔗8) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗1) 支柱七：动作重定向 (Motion Retargeting) (4) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (29 篇)

#	题目	一句话要点	标签	🔗
1	Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis	提出融合多模态信息的MLLM增强方案，用于安全驾驶视频分析	large language model multimodal
2	PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought	PointLLM-R：通过思维链增强3D点云推理能力	multimodal instruction following chain-of-thought
3	AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture	AgroTools：农业领域工具增强型多模态Agent基准测试	large language model multimodal	✅
4	Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding	提出Seizure-Semiology-Suite数据集与基准，用于评估和提升多模态大模型对癫痫发作症状学的理解能力。	large language model multimodal
5	Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure	针对冻结视觉基础模型的噪声鲁棒训练：跨数据集基准测试与小损失失效案例研究	foundation model
6	MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding	提出MOTOR数据集以解决两轮车骑行行为理解问题	multimodal
7	Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement	提出基于多模态知识图谱和可靠性引导的病例感知医学图像分类框架	multimodal
8	Bernini: Latent Semantic Planning for Video Diffusion	Bernini：提出基于潜在语义规划的视频扩散模型，用于高质量视频生成与编辑。	large language model multimodal chain-of-thought
9	Accelerating Vision Foundation Models with Drop-in Depthwise Convolution	提出基于深度卷积的替代方案以加速视觉基础模型	foundation model
10	VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results	VISTA：融合时空基础模型与解剖学解码，用于罕见病理VCE事件检测	foundation model
11	AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding	AgroVG：用于农业视觉定位的大规模多源基准数据集	visual grounding
12	Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction	提出用于情感模仿强度预测的两阶段多模态融合框架，在Hume-ABAW10挑战赛中获得第三名。	multimodal
13	Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models	提出Director-Experts (DEX)模型，解决多模态医学影像中非独立同分布特征导致的表示坍塌问题。	foundation model	✅
14	GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT	提出GLeVE框架，通过图引导和提案验证实现3D CT图像中病灶的精准定位。	foundation model multimodal
15	VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis	提出VGenST-Bench，通过主动视频合成评估多模态大语言模型中的时空推理能力。	large language model multimodal
16	GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning	GeoWeaver：提出一种预推理几何 grounding 框架，提升视觉语言模型中的时空推理能力。	large language model multimodal	✅
17	FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning	提出FashionLens以解决多样化时尚图像检索问题	large language model multimodal	✅
18	EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning	提出EvoIR-Agent，通过经验驱动学习实现自进化图像修复智能体系统	large language model multimodal
19	Zero-Shot Temporal Action Localization Through Textual Guidance	提出TEGU，利用文本引导实现零样本时序动作定位，无需训练数据。	large language model zero-shot transfer
20	MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues	通过注意力线索揭示和恢复时间定位，提升MLLM在视频时序定位任务上的性能。	large language model multimodal	✅
21	Cambrian-P: Pose-Grounded Video Understanding	Cambrian-P：提出一种基于相机位姿的多模态视频理解模型，提升空间推理能力。	multimodal
22	DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders	DecQ：通过细节浓缩查询增强表征自编码器的重建与生成能力	foundation model
23	Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models	提出CEDAR，通过稀疏解耦变换提升视觉-语言模型嵌入的可解释性，无需增加维度。	multimodal
24	SceneAligner: 3D-Grounded Floorplan Localization in the Wild	SceneAligner：基于3D场景重建的室外环境平面图定位方法	foundation model
25	Translating Signals to Languages for sEMG-Based Activity Recognition	提出LLM-sEMG框架，利用大语言模型实现高精度sEMG信号活动识别	large language model
26	Direct content-based retrieval from music scores images	提出音乐乐谱图像直接内容检索方法，提升音乐信息检索效率	large language model
27	EventGait: Towards Robust Gait Recognition with Event Streams	EventGait：利用事件流实现稳健的步态识别，尤其在低光照环境下表现出色。	foundation model	✅
28	GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery	GenHAR：面向末端配送的跨域人体活动识别泛化框架	foundation model	✅
29	Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception	Thermo-VL：扩展视觉-语言模型至热红外感知，提升低照度场景理解能力	visual grounding	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗
30	Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts	Pre-VLA：面向VLA模型和世界模型的可靠性，提出抢占式运行时验证架构。	world model world models vision-language-action
31	LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model	LVDrive：基于潜在视觉表征增强的视觉-语言-动作自动驾驶模型	world model world models representation learning
32	CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models	CrossVLA：跨范式VLA模型的后训练与推理优化	flow matching DPO vision-language-action	✅
33	Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution	提出FlowGS，用于遥感图像连续尺度超分辨率重建，提升推理效率。	flow matching gaussian splatting splatting
34	Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning	提出CRPO以提升视频LLMs的时空敏感性问题	reinforcement learning spatiotemporal large language model	✅
35	EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models	EvoVid：面向视频大语言模型的时间中心自进化框架	reinforcement learning large language model
36	SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation	SegCompass：利用稀疏自编码器实现可解释对齐，提升推理分割性能	reinforcement learning large language model chain-of-thought	✅
37	From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding	提出ReceiptBench基准测试，并用度量感知强化学习优化MLLM在真实票据理解任务上的性能。	reinforcement learning large language model multimodal	✅
38	RiT: Vanilla Diffusion Transformers Suffice in Representation Space	RiT：仅用Vanilla Diffusion Transformer在表征空间实现高效图像生成	flow matching representation learning distillation	✅
39	Ultra-High-Definition Image Quality Assessment via Graph Representation Learning	提出基于图表示学习的UHD-GCN-BIQA模型，提升超高清图像质量评估性能	representation learning
40	TextTeacher: What Can Language Teach About Images?	TextTeacher：利用语言模型知识提升图像分类模型性能	distillation multimodal
41	Visual-Advantage On-Policy Distillation for Vision-Language Models	提出Visual-Advantage On-Policy Distillation，提升视觉语言模型对视觉输入的依赖	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗
42	GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation	提出GA-VLN，利用几何感知BEV表示提升视觉语言导航效率与性能	3D reconstruction geometric consistency VLN
43	ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting	ForeSplat：面向优化的前瞻性训练，加速3D高斯溅射重建	3D gaussian splatting 3DGS 3D reconstruction
44	TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting	TWINGS：基于薄板样条的稀疏视角高斯溅射初始化方法	3D gaussian splatting 3DGS gaussian splatting
45	SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation	提出SpaceDG：用于评估视觉退化下多模态大模型空间智能的首个大规模基准数据集。	3D gaussian splatting 3DGS gaussian splatting
46	4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting	提出4D-GSW，解决4D高斯溅射中时空一致的水印嵌入问题。	gaussian splatting splatting spatiotemporal
47	Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving	提出Sensor2Sensor，将行车记录仪视频转换为自动驾驶所需的多模态传感器数据。	gaussian splatting splatting cross-embodiment
48	H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning	提出H-Flow，通过物理启发的自监督多模态学习实现人体场景流估计。	scene flow human motion
49	Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following	提出头部条件局部LoRA与视锥外惩罚，增强视觉基础模型中的注视推理能力	scene understanding foundation model
50	No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos	NoPo4D：首个从无位姿多视角视频中进行前馈动态高斯建模的系统	3D gaussian splatting gaussian splatting splatting
51	Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction	提出GenRe，一种扩散模型引导的通用增强器，用于提升城市场景重建在未见视角的质量。	scene reconstruction
52	COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition	提出COCOTree数据集与基准，用于开放树结构视觉分解任务。	open-vocabulary open vocabulary	✅
53	GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction	GazePrior：通过学习3D注视重建实现零样本AR/VR眼动追踪	3D reconstruction

🔬 支柱七：动作重定向 (Motion Retargeting) (4 篇)

#	题目	一句话要点	标签
54	Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction	提出基于能量联合优化的上下文引导扩散模型，用于解决多智能体运动预测中多样性与一致性难题。	human motion human motion prediction motion prediction
55	AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild	AnyMo：针对可穿戴设备，实现几何感知和环境无关的人体运动建模	human motion motion representation
56	AtomicMotion: Learning Human Motion From Different Human Parts	AtomicMotion：通过解耦人体部位学习人体运动，提升AR/VR沉浸式体验。	human motion
57	SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data	SADGE：通过结构和外观域差异估计合成数据与真实数据的性能差距	geometric consistency

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
58	From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model	提出BehaviorVLA，通过学习时序连贯的行为表示，提升VLA模型在分布偏移下的泛化能力。	manipulation sim-to-real Mamba
59	Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion	提出RTS：通过奖励引导的稀疏缩放优化扩散模型测试时性能	trajectory optimization

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
60	Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates	提出MDIC：一种利用多模态边信息进行极低码率分布式图像压缩的框架	VQ-VAE multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
61	Time-varying rPPG signal separation via block-sparse signal model	提出基于块稀疏信号模型的时变rPPG信号分离方法，解决光照变化下的信号提取难题。	PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-05-21）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (29 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (4 篇)

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理