cs.CV(2026-05-11)

📊 共 63 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (23 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (18 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (16 🔗4) 支柱八:物理动画 (Physics-based Animation) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (23 篇)

#题目一句话要点标签🔗
1 MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph MicroWorld:通过多模态属性图增强MLLM在微观领域的推理能力 large language model multimodal
2 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models 提出CapVector方法,通过参数空间解耦实现视觉-语言-动作模型的轻量化能力增强 vision-language-action VLA
3 Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning 提出DRAPE框架:通过动态跨模态提示生成解决多模态持续指令微调中的灾难性遗忘问题 large language model multimodal
4 EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving 提出EnergyLens:一种基于符号回归的闭式能耗模型,实现多模态大模型推理的能效优化 large language model multimodal
5 SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation 提出SciVQR多学科多模态基准,旨在全面评估大模型在复杂科学推理中的表现 large language model multimodal
6 C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving 提出C-CoT反事实思维链框架,利用视觉语言模型提升自动驾驶决策安全性 chain-of-thought
7 Personal Visual Context Learning in Large Multimodal Models 提出个人视觉上下文学习(Personal VCL)框架与Agentic Context Bank,提升大模型对用户专属视觉信息的理解能力。 multimodal
8 Qwen-Image-2.0 Technical Report Qwen-Image-2.0:提出全能型图像生成基础模型,实现高保真生成与精准编辑的统一 foundation model multimodal instruction following
9 BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization 提出BGG框架,通过视觉基础模型适配弥合跨视角图像间的几何差异,提升地理定位性能。 foundation model
10 ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models 提出ViSRA:一种无需训练的视频空间推理智能体,旨在提升多模态大模型的3D空间理解能力。 large language model
11 TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models 提出TOC-Bench基准以评估视频大模型在物体时序一致性方面的推理能力 large language model
12 Count Anything at Any Granularity 提出多粒度计数框架HieraCount与大规模数据集KubriCount,实现开放世界下的精准目标计数 large language model multimodal
13 V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning 提出V-ABS框架:通过动作-观察者驱动的束搜索解决多模态大模型动态视觉推理中的IAO偏差问题 large language model multimodal
14 ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning 提出ERASE框架,通过自适应两阶段视觉Token剪枝技术解决多模态大模型计算冗余问题。 large language model multimodal
15 Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment 提出PRAF-Attack框架,通过渐进式分辨率处理与自适应特征对齐提升MLLM黑盒攻击迁移性 large language model multimodal
16 The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space 提出Polaris-Bench基准测试以揭示多模态大模型在视觉推理中的笛卡尔捷径依赖问题 large language model multimodal
17 BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation 提出BabelDOC框架:通过中间表示(IR)实现高保真布局的PDF文档翻译 multimodal
18 Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization 提出基于熵最大化的无目标越狱方法UJEM-KL,显著提升视觉语言模型的攻击迁移性。 multimodal
19 AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State 提出AllocMV框架,通过结构化持久状态与多选背包问题求解实现音乐视频的高效生成。 multimodal
20 Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination 提出HAVAE干预策略,通过识别并抑制“词汇劫持”现象以缓解LVLM幻觉问题 multimodal
21 Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence 提出TwNV框架,通过生成式新视角合成增强大模型空间推理能力 multimodal
22 Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection 提出Sens-VisualNews基准数据集,以推动新闻图像中煽动性内容检测的研究 multimodal
23 SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation 提出SleepWalk基准测试,旨在压力测试指令引导下的视觉语言导航与具身推理能力 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (18 篇)

#题目一句话要点标签🔗
24 AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting 提出AdaptSplat:通过轻量级频率保持适配器提升前馈3D高斯泼溅的几何保真度 3D gaussian splatting 3DGS gaussian splatting
25 PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction 提出PaMoSplat框架,通过部件感知与运动引导实现高保真动态场景重建 3DGS gaussian splatting splatting
26 TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering 提出TransmissiveGS框架,通过残差引导的解耦高斯溅射实现透射场景的高保真重建与渲染。 gaussian splatting splatting scene reconstruction
27 Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation 提出GraphDepth架构:通过融合CNN与GNN实现高效单目深度估计 depth estimation monocular depth spatial relationship
28 UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting 提出基于物理信息高斯溅射(PIGS)的无人机滑坡扫描与仿真框架 3DGS gaussian splatting splatting
29 Neuromorphic Monocular Depth Estimation with Uncertainty Modeling 提出基于神经形态视觉的单目深度估计方法,通过不确定性建模提升深度预测可靠性。 depth estimation monocular depth
30 DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions DySurface:通过桥接显式高斯和隐式函数实现一致的4D表面重建 3D gaussian splatting 3DGS gaussian splatting
31 CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation 提出CADBench多模态基准,系统性评估AI辅助CAD程序生成的性能与鲁棒性 3D reconstruction multimodal
32 Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction 提出基于虚拟遥感与度量尺度前馈3D重建的森林燃料载量快速估算方法 3D reconstruction VGGT geometric consistency
33 BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry 提出BathyFacto:一种基于折射感知双介质神经辐射场的水下测深方法 NeRF neural radiance field
34 SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis 提出SDTalk框架,利用结构化面部先验与双分支运动场实现通用化3D高斯溅射人脸合成。 3D gaussian splatting 3DGS gaussian splatting
35 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects 发布大规模3DReflecNet数据集,旨在解决反射、透明及低纹理物体的三维重建难题 3D reconstruction
36 GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth 提出GemDepth框架,通过几何嵌入特征实现高精度的3D一致性视频深度估计 depth estimation geometric consistency
37 Predicting 3D structure by latent posterior sampling 提出基于潜在后验采样的3D结构预测方法 3D reconstruction NeRF
38 DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer 提出DetRefiner框架,通过特征融合Transformer实现模型无关的开放词汇检测优化 open-vocabulary open vocabulary
39 OpenSGA: Efficient 3D Scene Graph Alignment in the Open World 提出OpenSGA框架:通过多模态融合与空间上下文实现开放世界高效3D场景图对齐 scene understanding
40 Pixal3D: Pixel-Aligned 3D Generation from Images 提出Pixal3D:一种基于像素对齐的3D生成范式,实现高保真图像到3D资产的转换 3D reconstruction
41 Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning 提出空间预测(SP)预训练任务,通过建模局部几何关系增强自监督学习的结构化表征能力 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
42 CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving 提出CoWorld-VLA多专家世界模型框架,通过显式世界表征增强自动驾驶端到端规划能力。 world model world models spatiotemporal
43 Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection 提出Thermal-Det:首个基于大语言模型监督的开放词汇热成像目标检测框架 distillation open-vocabulary open vocabulary
44 MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning 提出MTA-RL框架,通过多模态Transformer 3D可供性与强化学习实现鲁棒城市自动驾驶 reinforcement learning reward shaping scene understanding
45 Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse 系统性综述通用游戏智能体:构建迈向通用人工智能(AGI)的“游戏多元宇宙”研究框架 reinforcement learning generalist agent foundation model
46 Is Your Driving World Model an All-Around Player? 提出WorldLens基准与评估体系,全面量化自动驾驶世界模型的物理与行为保真度 world model world models geometric consistency
47 Developing a foundation model for high-resolution remote sensing data of the Netherlands 提出一种结合CNN与ViT的遥感基础模型,通过时序数据增强实现高效特征表征学习 representation learning foundation model
48 Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization 提出Omni-Persona基准框架,系统性评估并提升多模态大模型的全模态个性化能力 reward design large language model multimodal
49 DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving 提出DeepSight世界模型,通过BEV空间潜在状态预测实现长时序端到端自动驾驶 world model world models
50 PhyGround: Benchmarking Physical Reasoning in Generative World Models 提出PhyGround基准与PhyJudge模型,系统性评估生成式世界模型的物理推理能力 world model world models
51 Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning 提出Sync-R1框架,通过协作强化学习实现多模态个性化理解与生成的协同优化 reinforcement learning multimodal
52 Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities 基于AlphaEarth Foundations表征学习的全球贫民窟检测与密度制图评估研究 representation learning foundation model
53 Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection 提出HGC-Det框架,利用双曲几何约束实现多模态3D目标检测的跨模态蒸馏 distillation multimodal
54 Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention 提出Polygon-Mamba网络,通过多边形扫描与空频协同注意力机制提升视网膜微小血管分割精度 Mamba state space model
55 Increasing the Efficiency of DETR for Maritime High-Resolution Images 针对海上高分辨率图像,提出基于ViM和token pruning的高效DETR目标检测方法 Mamba SSM state space model
56 PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows 提出PixelFlowCast框架,通过像素级均值流实现无潜空间的高效高保真降水临近预报 flow matching spatiotemporal
57 Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable 指出生活记录视频流中隐私与效用不可避免的权衡,并呼吁构建全流程隐私保护框架 world model world models

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
58 iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning 提出iPay多模态集成框架,通过自适应空间先验学习实现车载场景下的精准支付动作识别。 spatiotemporal multimodal
59 EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs 提出EchoPrune:通过将冗余视频Token解释为时间回声,实现高效的长视频理解 spatiotemporal large language model
60 SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation 提出SocialDirector:一种无需训练的多人视频生成社交交互控制框架 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
61 MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery 提出MoPO框架:通过引入运动先验解决遮挡场景下的人体网格恢复问题 human mesh recovery human motion human motion prediction
62 EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding 提出EgoMemReason基准,旨在解决长周期第一人称视频理解中的记忆驱动推理挑战 egocentric multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
63 Geometric 4D Stitching for Grounded 4D Generation 提出几何4D拼接框架(Geometric 4D Stitching),实现高效且几何一致的4D场景生成与扩展。 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页