cs.CV（2026-05-28）

📊 共 81 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (26 🔗7) 支柱三：空间感知与语义 (Perception & Semantics) (23 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (16 🔗4) 支柱一：机器人控制 (Robot Control) (8) 支柱四：生成式动作 (Generative Motion) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (26 篇)

#	题目	一句话要点	标签	🔗
1	VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies	提出VisualThink-VLA，通过视觉中间推理实现高效低延迟的视觉-语言-动作策略。	vision-language-action VLA chain-of-thought
2	Archon: A Unified Multimodal Model for Holistic Digital Human Generation	Archon：用于整体数字人生成的统一多模态模型	multimodal	✅
3	DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark	DocRetriever：一个即插即用的多模态文档检索框架，并构建了全面的基准测试。	multimodal
4	VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents	VideoFDB：提出首个全双工视听对话基准，评估对话Agent的非语言交互能力。	multimodal visual grounding
5	AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection	提出AnomalyAgent，一种无需训练的Agentic模型，用于零/少样本异常检测。	large language model multimodal
6	Genetically Aligned Patient Representations Improve Hematological Diagnosis	提出基因对齐的患者表征方法，提升血液学诊断性能。	foundation model multimodal	✅
7	Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering	提出BRACS，通过自适应闭式引导缓解视觉语言模型中的幻觉问题。	multimodal visual grounding
8	SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation	SuperVoxelGPT：自回归形状生成的自适应有序3D Token化方法	large language model multimodal
9	CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning	CogniVerse：融合认知反射与几何推理的多模态检索增强生成框架	large language model multimodal
10	Grounded 3D-Aware Spatial Vision-Language Modeling	提出GR3D：一种具有显式和隐式2D以及单目3D grounding能力的空间视觉语言模型	chain-of-thought
11	LoMo: Local Modality Substitution for Deeper Vision-Language Fusion	提出LoMo局部模态替换方法，提升视觉-语言模型跨模态融合的鲁棒性。	multimodal
12	Unveiling the Visual Counting Bottleneck in Vision-Language Models	揭示视觉语言模型中视觉计数瓶颈：符号映射失败导致外推泛化能力不足	foundation model
13	EarlyTom: Early Token Compression Completes Fast Video Understanding	EarlyTom：早期Token压缩加速视频理解，显著降低时间延迟。	large language model
14	Masked Diffusion Vision-Language Models for Temporal Action Localization	提出MDVLM-TAL，利用掩码扩散模型解决时序动作定位中时间边界难以修正的问题。	language conditioned
15	Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models	Pocket-Dentist：通过高效多模态大语言模型实现设备端牙科图像理解	large language model multimodal
16	ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation	ReactBench：提出一个因果驱动的多模态幻觉评测基准，系统性评估视觉语言模型。	large language model multimodal chain-of-thought	✅
17	WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction	提出WorldMemArena以评估多模态智能体记忆在行动-世界交互中的表现	large language model multimodal
18	DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning	提出DMC-CF：用于因果推理的动态多模态反事实问答基准	large language model multimodal
19	PInVerify: An Offline Embodied Benchmark for Active Instance Verification	提出PInVerify离线具身基准，用于主动实例验证任务	embodied AI large language model multimodal	✅
20	Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset	提出CFMME：一个全面的中文金融多模态评估数据集，用于评测大视觉语言模型	multimodal
21	SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation	SuperVoxelGPT：自回归形状生成中自适应有序的3D Token化方法	large language model multimodal
22	ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models	提出ReGuLaR框架，通过关系图推理增强大型视觉语言模型的潜在推理能力。	chain-of-thought
23	On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection	提出一种GDPR合规的端侧生成式AI视觉监控系统，实现本地目标检测与自然语言警报生成。	large language model
24	GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection	提出GiPL，通过生成增强迭代伪标签解决跨域小样本目标检测问题	foundation model	✅
25	FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation	FlowSeg：提出动态语义引导机制，提升LLM条件下的图像分割性能	large language model	✅
26	FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation	FedSmoothLoRA：面向联邦低秩适应的平滑快速收敛方法	foundation model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (23 篇)

#	题目	一句话要点	标签	🔗
27	Supercharging Thermal Gaussian Splatting with Depth Estimation	提出基于热红外图像和深度估计的TDg方法，加速并提升3D高斯溅射性能。	depth estimation 3D gaussian splatting 3D reconstruction
28	PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions	PhyGenHOI：提出物理感知的动态人-物交互4D生成框架	3DGS motion diffusion model MDM	✅
29	DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding	DGSG-Mind：用于长期场景理解和定位的动态3D高斯场景图	scene reconstruction scene understanding semantic mapping	✅
30	Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field	提出基于各向异性可见度场的3D高斯溅射主动建图方法，实现不确定性驱动。	3D gaussian splatting 3DGS gaussian splatting
31	From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments	ViTA：面向非结构化环境，自适应视觉基础模型的可靠地形可通行性估计	traversability foundation model
32	FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views	FRUC：基于无标定协同驾驶视角的动态场景前馈重建	3D gaussian splatting gaussian splatting splatting
33	OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics	OmniCD：多模态语义引导的遥感图像变化检测基础框架	semantic map multimodal
34	City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images	City-Mesh3R：从多视角图像重建可用于仿真的城市级三维网格模型	3D reconstruction gaussian splatting splatting
35	REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image	REST3D：提出物理约束的单图三维场景重建框架，提升场景物理稳定性。	scene understanding penetration human-object interaction
36	Large Depth Completion Model from Sparse Observations	提出LDCM：基于Transformer的大规模稀疏深度补全模型	depth estimation metric depth foundation model
37	Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence	提出基于3D先验的语义对应学习框架，提升模型对3D结构的感知能力。	sam 3D SAM 3D foundation model
38	MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos	MonoPhysics：单目视频中几何、外观和物理参数的联合估计	3D gaussian splatting gaussian splatting splatting	✅
39	Déjà View: Looping Transformers for Multi-View 3D Reconstruction	Déjà View：循环Transformer用于多视角3D重建，提升效率与性能	3D reconstruction
40	Towards Consistent Video Geometry Estimation	ViGeo：用于视频序列时空一致几何估计的通用前馈模型	depth estimation foundation model
41	DVSM: Decoder-only View Synthesis Model Done Right	DVSM：仅解码器视角合成模型，性能超越传统编码器-解码器结构	3DGS foundation model
42	GMOS: Grounding Moving Object Segmentation in 3D Space and Time	提出GMOS框架以解决移动物体分割中的3D信息缺失问题	optical flow
43	BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression	BitC-3DGS：通过比特压缩实现高容量3D高斯溅射水印	3D gaussian splatting 3DGS gaussian splatting
44	Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis	比较四种三维重建方法以评估路面粗糙度	3D gaussian splatting 3DGS 3D reconstruction
45	DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding	提出DGSG-Mind以解决动态3D场景理解中的实例关联脆弱问题	scene reconstruction scene understanding semantic mapping	✅
46	Learning Representations from 3D Gaussian Splats	评估几何深度学习在3D高斯溅射场景理解中的应用	3D gaussian splatting 3DGS gaussian splatting
47	Déjà View: Looping Transformers for Multi-View 3D Reconstruction	Déjà View：循环Transformer用于多视角3D重建，提升效率与性能	3D reconstruction
48	Towards Consistent Video Geometry Estimation	ViGeo：提出用于视频序列时空一致几何估计的通用前馈模型	depth estimation foundation model
49	VLM3: Vision Language Models Are Native 3D Learners	VLM3：利用视觉语言模型实现原生3D场景理解	depth estimation

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗
50	SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World	SAM3D-Phys：从真实世界重建场景中恢复可交互仿真的完整物体几何	distillation scene reconstruction sam 3D	✅
51	minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models	minWM：用于实时交互视频世界模型的全栈开源框架	world model world models distillation	✅
52	AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning	提出AgentCVR，通过脚本模拟强化学习解决跨视频推理中证据获取难题。	reinforcement learning large language model multimodal	✅
53	EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation	提出EVL-ECG，通过异构知识蒸馏实现高效的心电图（ECG）判读。	distillation feature matching foundation model
54	FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection	FakeVLM-R1：通过思维链和物理规律内化提升合成图像检测能力	imitation learning multimodal chain-of-thought
55	SLAD : Shared LoRA Adapters for Task Specific Distillation	提出SLAD：面向任务特定蒸馏的共享LoRA适配器，提升小模型性能。	distillation foundation model
56	LiveSVG: Zero-Shot SVG Animation via Video Generation	LiveSVG：基于视频生成的零样本SVG动画方法	distillation motion representation
57	xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR	提出xModel-KD，利用跨模态知识蒸馏提升LiDAR点云3D场景感知性能。	distillation scene understanding
58	Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning	Stable-Layers：利用VLM评分的强化学习微调图像层分解模型，无需配对监督。	reinforcement learning
59	Reinforcement Learning with Robust Rubric Rewards	提出 RLR³，通过鲁棒的准则奖励强化学习，提升视觉-语言任务性能。	reinforcement learning
60	SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation	提出SGMD：面向少步视频扩散蒸馏的分数梯度匹配蒸馏方法	distillation	✅
61	GeoMag: Geometric-Aware Video Motion Magnification via State Space Model	提出GeoMag：基于状态空间模型的几何感知视频运动放大方法	state space model
62	NeuROK: Generative 4D Neural Object Kinematics	NeuROK：生成式4D神经对象运动学，实现逼真的物体形变模拟	world model world models
63	Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing	提出基于聚类引导的域特定预训练模型以提升北极遥感分析	masked autoencoder MAE foundation model
64	UniNote: A Unified Embedding Model for Multimodal Representation and Ranking	提出UniNote，用于解决工业级Item-to-Item检索中多模态表征与排序的挑战。	reinforcement learning representation learning multimodal
65	Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization	提出 Guidance Contrastive Policy Optimization (GCPO) 以实现离散策略优化中的 Token 级别信用分配	reinforcement learning policy learning chain-of-thought

🔬 支柱一：机器人控制 (Robot Control) (8 篇)

#	题目	一句话要点	标签
66	SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation	提出SAFE-Pruner，通过语义注意力引导的未来感知token剪枝加速VLA模型推理。	manipulation vision-language-action VLA
67	YoCausal: How Far is Video Generation from World Model? A Causality Perspective	YoCausal：从因果关系视角评估视频生成模型与世界模型的差距	sim-to-real world model world models
68	Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning	提出几何引导的形变学习框架，实现通用物体形状重建	manipulation
69	Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation	Dex2HOI：提出双流扩散模型，用于生成灵巧的双手双物体交互动作	manipulation bi-manual motion synthesis
70	Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning	通过逆动力学学习缓解视觉-语言-动作模型中的状态混淆问题	manipulation vision-language-action VLA
71	SalsaAgent: A multimodal embodied language model for interactive dance generation	SalsaAgent：提出一种多模态具身语言模型，用于生成交互式舞蹈动作。	humanoid large language model multimodal
72	Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes	重构工业视觉Sim-to-Real：基于先验可得性的CAD引导与非CAD引导方法综述	sim-to-real teacher-student
73	ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects	ParCo-SDF：学习可变形物体无先验的部分到完整SDF重建	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗
74	Colored Noise Diffusion Sampling	提出彩色噪声采样(CNS)，通过频率解耦能量转移提升扩散模型图像合成质量。	classifier-free guidance	✅
75	S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields	提出S2MDF，一个即插即用的模块，用于解决多物体SDF表示中的相交问题。	penetration
76	AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling	AnyMo：基于掩码建模的通用模态条件运动生成框架	motion synthesis motion generation motion tokenizer

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
77	GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver	GenEraser：提出一种基于平衡文本-掩码引导和解耦定位-保持器的通用视频对象移除框架。	spatiotemporal multimodal	✅
78	Veda: Scalable Video Diffusion via Distilled Sparse Attention	Veda：通过蒸馏稀疏注意力实现可扩展的视频扩散模型	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
79	Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball	提出Mesh-Aware Epipolar Matching解决篮球比赛中多人3D姿态估计问题	human mesh recovery
80	Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge	提出融合语义与视觉证据的框架，解决长时程视频问答难题	egocentric large language model multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
81	Turbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement	提出一种基于多信号先验和SAM2优化的湍流鲁棒动态目标分割方法	motion estimation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-05-28）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (26 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (23 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

🔬 支柱一：机器人控制 (Robot Control) (8 篇)

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理