cs.CV(2026-03-24)

📊 共 55 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱五:交互与反应 (Interaction & Reaction) (2 🔗2) 支柱八:物理动画 (Physics-based Animation) (2 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding 提出ENC-Bench,用于评估多模态大语言模型在电子海图理解中的能力。 large language model multimodal symbolic grounding
2 YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception 提出基于Kolmogorov-Arnold网络和视觉-语言模型的YOLOv10,用于可解释的目标检测和可信赖的多模态AI foundation model multimodal
3 MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding MLLM-HWSI:用于分层全切片图像理解的多模态大语言模型 large language model multimodal
4 ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling ForestPrune:通过时空森林建模实现视频多模态大语言模型的高比例视觉Token压缩 large language model multimodal
5 SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning SpecEyes:通过推测性感知与规划加速Agentic多模态LLM large language model multimodal
6 GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning GeoTikzBridge:通过Tikz代码生成增强多模态大模型几何感知与推理能力 large language model multimodal
7 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding 提出3DCity-LLM,赋能多模态大语言模型进行3D城市级感知与理解 large language model
8 Multimodal Industrial Anomaly Detection via Geometric Prior 提出基于几何先验的多模态工业异常检测网络,提升复杂几何缺陷检测精度。 multimodal
9 UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation UniFunc3D:统一的主动时空定位框架,用于3D功能分割 large language model multimodal
10 ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting ViKey:通过视觉提示增强视频大语言模型的时间理解能力 large language model multimodal
11 SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions 提出SMSP多尺度感知策略,提升MLLM对视觉错觉的识别能力。 large language model multimodal
12 Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps Cog3DMap:利用3D认知地图实现多视角视觉-语言推理 large language model multimodal
13 ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance ForeSea:面向视频监控的多模态查询AI取证搜索系统 large language model multimodal
14 Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding PinPoint:聚焦而非剪枝,识别信息密集图像中指令相关区域,提升视觉语言模型效率。 large language model multimodal
15 Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models Know3D:利用视觉-语言模型知识提示3D生成,实现可控后视图生成。 large language model multimodal
16 OccAny: Generalized Unconstrained Urban 3D Occupancy OccAny:首个广义无约束城市3D Occupancy预测模型,提升泛化性和几何补全能力。 foundation model
17 DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection DetPO:利用多模态LLM的上下文学习进行少样本目标检测,提升泛化能力。 visual grounding
18 AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection AgentFoX:基于LLM Agent引导的AI生成图像检测与可解释性融合框架 large language model
19 SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts SOUPLE:利用可学习提示上下文增强音视频定位与分割 multimodal
20 Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth 提出Think 360°基准,评估多模态大模型在推理宽度上的能力。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
21 Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought 提出感知-探索策略优化(PEPO),提升多模态CoT推理中视觉 grounding 和探索推理的平衡。 reinforcement learning multimodal visual grounding
22 PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding PhotoAgent:融合空间与美学理解的机器人摄影师 world model 3D gaussian splatting 3DGS
23 EVA: Efficient Reinforcement Learning for End-to-End Video Agent 提出EVA:基于强化学习的高效端到端视频Agent,用于解决长视频理解难题。 reinforcement learning large language model multimodal
24 Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference 提出SpaHGC模型,通过跨切片知识迁移提升空间基因表达推断精度。 contrastive learning spatial relationship foundation model
25 UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation 提出UniGRPO,用于推理驱动的视觉生成统一策略优化 reinforcement learning flow matching classifier-free guidance
26 Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning 提出基于Mamba的MRI-to-CT合成方法,用于MRI引导的放疗计划。 Mamba geometric consistency
27 Conformal Cross-Modal Active Learning 提出CCMA,利用多模态知识提升视觉主动学习的数据效率。 teacher-student foundation model multimodal
28 WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG 提出WildWorld:一个大规模动作条件动态世界建模数据集,用于生成式ARPG。 reinforcement learning world model
29 FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning 提出FCL-COD框架,通过频率感知和对比学习解决弱监督伪装目标检测难题。 representation learning contrastive learning
30 Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression 提出Ker-VLJEPA-3B,通过课程学习和无语言视觉嫁接生成3D CT报告 curriculum learning large language model
31 Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation 提出DDSR模型,通过双教师蒸馏和子网络校正解决黑盒领域自适应问题 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
32 Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors 提出PFGS360,实现无位姿全景视频的3D高斯重建与高质量新视角合成 monocular depth 3D gaussian splatting 3DGS
33 Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis 针对新视角合成,提出高斯溅射光度不确定性预测框架,提升空间地图可靠性 3D gaussian splatting gaussian splatting splatting
34 Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation 提出GLA-CLIP,通过全局-局部对齐CLIP模型,实现免训练开放词汇语义分割。 open-vocabulary open vocabulary
35 Generative Event Pretraining with Foundation Model Alignment 提出GEP:通过对齐视觉基础模型进行生成式事件预训练 depth estimation foundation model
36 DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models DA-Flow:基于扩散模型的退化感知光流估计,提升真实场景下的鲁棒性 optical flow
37 UniQueR: Unified Query-based Feedforward 3D Reconstruction UniQueR:一种用于高效精确三维重建的统一查询式前馈框架 NeRF VGGT
38 One View Is Enough! Monocular Training for In-the-Wild Novel View Generation 提出OVIE,仅用单视角图像训练,实现野外场景的新视角生成 monocular depth
39 SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes SLARM:用于动态场景的流式语言对齐重建模型 scene reconstruction
40 Group Editing : Edit Multiple Images in One Go 提出GroupEditing框架,用于对一组相关图像进行一致性编辑。 VGGT

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
41 ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment ABot-PhysWorld:基于物理对齐的交互式世界基础模型,用于机器人操作 manipulation DPO world model
42 TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation TETO:利用教师观测追踪事件,用于运动估计和帧插值 sim-to-real teacher-student distillation
43 VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models VLA-IAP:通过交互对齐实现免训练视觉Token剪枝,加速VLA模型推理。 manipulation vision-language-action VLA
44 Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation 提出基于注视正则化的VLA模型,提升机器人操作任务性能。 manipulation vision-language-action VLA
45 RealMaster: Lifting Rendered Scenes into Photorealistic Video RealMaster:利用视频扩散模型将渲染场景提升为照片级真实视频 sim-to-real

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
46 Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning TRACE:通过文本引导多模态大模型进行3D空间推理 egocentric large language model multimodal
47 GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field GSwap:利用动态神经高斯场实现逼真头部替换 SMPL SMPL-X
48 Gaze-Regularized VLMs for Ego-Centric Behavior Understanding 提出基于注视正则化的VLM,用于提升以自我为中心的行为理解能力 egocentric

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
49 InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance 提出InterDyad框架,通过查询中间视觉引导实现交互式双人语音到视频生成。 two-person interaction dyadic interaction large language model
50 A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection 提出特征洗牌与恢复策略以解决通用无监督异常检测问题 ReMoS

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
51 WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction WaveSFNet:基于小波编解码和空频双域门控网络的时空预测 spatiotemporal
52 MultiCam: On-the-fly Multi-Camera Pose Estimation Using Spatiotemporal Overlaps of Known Objects 提出MultiCam,利用时空重叠的已知物体进行动态多相机位姿估计 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
53 A Synchronized Audio-Visual Multi-View Capture System 提出一种同步音视频多视角采集系统,用于细粒度会话行为分析。 human motion
54 Object Pose Transformer: Unifying Unseen Object Pose Estimation Object Pose Transformer:统一的无监督物体姿态估计框架 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
55 SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM SIMART:通过MLLM将整体网格分解为可用于仿真的铰接资产 VQ-VAE embodied AI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页