cs.CV（2026-05-11）

📊 共 63 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (23 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (18 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (16 🔗4) 支柱八：物理动画 (Physics-based Animation) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

#	题目	一句话要点	标签	🔗
1	MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph	MicroWorld：通过多模态属性图增强MLLM在微观领域的推理能力	large language model multimodal	✅
2	CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models	提出CapVector方法，通过参数空间解耦实现视觉-语言-动作模型的轻量化能力增强	vision-language-action VLA
3	Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning	提出DRAPE框架：通过动态跨模态提示生成解决多模态持续指令微调中的灾难性遗忘问题	large language model multimodal
4	EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving	提出EnergyLens：一种基于符号回归的闭式能耗模型，实现多模态大模型推理的能效优化	large language model multimodal
5	SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation	提出SciVQR多学科多模态基准，旨在全面评估大模型在复杂科学推理中的表现	large language model multimodal	✅
6	C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving	提出C-CoT反事实思维链框架，利用视觉语言模型提升自动驾驶决策安全性	chain-of-thought
7	Personal Visual Context Learning in Large Multimodal Models	提出个人视觉上下文学习（Personal VCL）框架与Agentic Context Bank，提升大模型对用户专属视觉信息的理解能力。	multimodal
8	Qwen-Image-2.0 Technical Report	Qwen-Image-2.0：提出全能型图像生成基础模型，实现高保真生成与精准编辑的统一	foundation model multimodal instruction following
9	BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization	提出BGG框架，通过视觉基础模型适配弥合跨视角图像间的几何差异，提升地理定位性能。	foundation model
10	ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models	提出ViSRA：一种无需训练的视频空间推理智能体，旨在提升多模态大模型的3D空间理解能力。	large language model
11	TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models	提出TOC-Bench基准以评估视频大模型在物体时序一致性方面的推理能力	large language model
12	Count Anything at Any Granularity	提出多粒度计数框架HieraCount与大规模数据集KubriCount，实现开放世界下的精准目标计数	large language model multimodal	✅
13	V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning	提出V-ABS框架：通过动作-观察者驱动的束搜索解决多模态大模型动态视觉推理中的IAO偏差问题	large language model multimodal
14	ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning	提出ERASE框架，通过自适应两阶段视觉Token剪枝技术解决多模态大模型计算冗余问题。	large language model multimodal	✅
15	Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment	提出PRAF-Attack框架，通过渐进式分辨率处理与自适应特征对齐提升MLLM黑盒攻击迁移性	large language model multimodal
16	The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space	提出Polaris-Bench基准测试以揭示多模态大模型在视觉推理中的笛卡尔捷径依赖问题	large language model multimodal
17	BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation	提出BabelDOC框架：通过中间表示（IR）实现高保真布局的PDF文档翻译	multimodal
18	Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization	提出基于熵最大化的无目标越狱方法UJEM-KL，显著提升视觉语言模型的攻击迁移性。	multimodal
19	AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State	提出AllocMV框架，通过结构化持久状态与多选背包问题求解实现音乐视频的高效生成。	multimodal
20	Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination	提出HAVAE干预策略，通过识别并抑制“词汇劫持”现象以缓解LVLM幻觉问题	multimodal	✅
21	Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence	提出TwNV框架，通过生成式新视角合成增强大模型空间推理能力	multimodal
22	Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection	提出Sens-VisualNews基准数据集，以推动新闻图像中煽动性内容检测的研究	multimodal
23	SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation	提出SleepWalk基准测试，旨在压力测试指令引导下的视觉语言导航与具身推理能力	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (18 篇)

#	题目	一句话要点	标签	🔗
24	AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting	提出AdaptSplat：通过轻量级频率保持适配器提升前馈3D高斯泼溅的几何保真度	3D gaussian splatting 3DGS gaussian splatting	✅
25	PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction	提出PaMoSplat框架，通过部件感知与运动引导实现高保真动态场景重建	3DGS gaussian splatting splatting
26	TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering	提出TransmissiveGS框架，通过残差引导的解耦高斯溅射实现透射场景的高保真重建与渲染。	gaussian splatting splatting scene reconstruction
27	Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation	提出GraphDepth架构：通过融合CNN与GNN实现高效单目深度估计	depth estimation monocular depth spatial relationship
28	UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting	提出基于物理信息高斯溅射（PIGS）的无人机滑坡扫描与仿真框架	3DGS gaussian splatting splatting
29	Neuromorphic Monocular Depth Estimation with Uncertainty Modeling	提出基于神经形态视觉的单目深度估计方法，通过不确定性建模提升深度预测可靠性。	depth estimation monocular depth
30	DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions	DySurface：通过桥接显式高斯和隐式函数实现一致的4D表面重建	3D gaussian splatting 3DGS gaussian splatting
31	CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation	提出CADBench多模态基准，系统性评估AI辅助CAD程序生成的性能与鲁棒性	3D reconstruction multimodal	✅
32	Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction	提出基于虚拟遥感与度量尺度前馈3D重建的森林燃料载量快速估算方法	3D reconstruction VGGT geometric consistency
33	BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry	提出BathyFacto：一种基于折射感知双介质神经辐射场的水下测深方法	NeRF neural radiance field
34	SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis	提出SDTalk框架，利用结构化面部先验与双分支运动场实现通用化3D高斯溅射人脸合成。	3D gaussian splatting 3DGS gaussian splatting
35	3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects	发布大规模3DReflecNet数据集，旨在解决反射、透明及低纹理物体的三维重建难题	3D reconstruction
36	GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth	提出GemDepth框架，通过几何嵌入特征实现高精度的3D一致性视频深度估计	depth estimation geometric consistency	✅
37	Predicting 3D structure by latent posterior sampling	提出基于潜在后验采样的3D结构预测方法	3D reconstruction NeRF
38	DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer	提出DetRefiner框架，通过特征融合Transformer实现模型无关的开放词汇检测优化	open-vocabulary open vocabulary	✅
39	OpenSGA: Efficient 3D Scene Graph Alignment in the Open World	提出OpenSGA框架：通过多模态融合与空间上下文实现开放世界高效3D场景图对齐	scene understanding
40	Pixal3D: Pixel-Aligned 3D Generation from Images	提出Pixal3D：一种基于像素对齐的3D生成范式，实现高保真图像到3D资产的转换	3D reconstruction	✅
41	Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning	提出空间预测（SP）预训练任务，通过建模局部几何关系增强自监督学习的结构化表征能力	depth estimation

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗
42	CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving	提出CoWorld-VLA多专家世界模型框架，通过显式世界表征增强自动驾驶端到端规划能力。	world model world models spatiotemporal	✅
43	Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection	提出Thermal-Det：首个基于大语言模型监督的开放词汇热成像目标检测框架	distillation open-vocabulary open vocabulary
44	MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning	提出MTA-RL框架，通过多模态Transformer 3D可供性与强化学习实现鲁棒城市自动驾驶	reinforcement learning reward shaping scene understanding
45	Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse	系统性综述通用游戏智能体：构建迈向通用人工智能（AGI）的“游戏多元宇宙”研究框架	reinforcement learning generalist agent foundation model
46	Is Your Driving World Model an All-Around Player?	提出WorldLens基准与评估体系，全面量化自动驾驶世界模型的物理与行为保真度	world model world models geometric consistency
47	Developing a foundation model for high-resolution remote sensing data of the Netherlands	提出一种结合CNN与ViT的遥感基础模型，通过时序数据增强实现高效特征表征学习	representation learning foundation model
48	Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization	提出Omni-Persona基准框架，系统性评估并提升多模态大模型的全模态个性化能力	reward design large language model multimodal
49	DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving	提出DeepSight世界模型，通过BEV空间潜在状态预测实现长时序端到端自动驾驶	world model world models	✅
50	PhyGround: Benchmarking Physical Reasoning in Generative World Models	提出PhyGround基准与PhyJudge模型，系统性评估生成式世界模型的物理推理能力	world model world models	✅
51	Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning	提出Sync-R1框架，通过协作强化学习实现多模态个性化理解与生成的协同优化	reinforcement learning multimodal	✅
52	Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities	基于AlphaEarth Foundations表征学习的全球贫民窟检测与密度制图评估研究	representation learning foundation model
53	Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection	提出HGC-Det框架，利用双曲几何约束实现多模态3D目标检测的跨模态蒸馏	distillation multimodal
54	Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention	提出Polygon-Mamba网络，通过多边形扫描与空频协同注意力机制提升视网膜微小血管分割精度	Mamba state space model
55	Increasing the Efficiency of DETR for Maritime High-Resolution Images	针对海上高分辨率图像，提出基于ViM和token pruning的高效DETR目标检测方法	Mamba SSM state space model
56	PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows	提出PixelFlowCast框架，通过像素级均值流实现无潜空间的高效高保真降水临近预报	flow matching spatiotemporal
57	Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable	指出生活记录视频流中隐私与效用不可避免的权衡，并呼吁构建全流程隐私保护框架	world model world models

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗
58	iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning	提出iPay多模态集成框架，通过自适应空间先验学习实现车载场景下的精准支付动作识别。	spatiotemporal multimodal	✅
59	EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs	提出EchoPrune：通过将冗余视频Token解释为时间回声，实现高效的长视频理解	spatiotemporal large language model
60	SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation	提出SocialDirector：一种无需训练的多人视频生成社交交互控制框架	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
61	MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery	提出MoPO框架：通过引入运动先验解决遮挡场景下的人体网格恢复问题	human mesh recovery human motion human motion prediction
62	EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding	提出EgoMemReason基准，旨在解决长周期第一人称视频理解中的记忆驱动推理挑战	egocentric multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
63	Geometric 4D Stitching for Grounded 4D Generation	提出几何4D拼接框架（Geometric 4D Stitching），实现高效且几何一致的4D场景生成与扩展。	geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-05-11）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (18 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理