cs.CV(2026-03-12)

📊 共 68 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (22 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (19 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱一:机器人控制 (Robot Control) (7) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (22 篇)

#题目一句话要点标签🔗
1 Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models 提出Think While Watching框架,解决MLLM在线视频流多轮推理中长时依赖建模问题。 large language model multimodal chain-of-thought
2 Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans 提出HouseMind,利用Token化统一多模态大语言模型以理解、生成和编辑建筑平面图 large language model multimodal
3 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models EndoCoT:扩散模型中可扩展的内生思维链推理框架 large language model multimodal chain-of-thought
4 CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation 提出CrossEarth-SAR,一个十亿级SAR地理空间基础模型,用于领域泛化语义分割。 foundation model
5 Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning 提出RICE-NET,利用多模态深度学习区分胶质母细胞瘤术后肿瘤复发与放射性损伤。 multimodal
6 Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography 提出SegAnyPET,用于3D全身PET图像通用分割的基础模型 foundation model
7 Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception 提出基于双重条件扩散模型的图像美学增强方法,利用多模态感知提升效果。 multimodal
8 MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning MM-CondChain:用于视觉条件深度组合推理的可验证基准测试 large language model multimodal
9 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously 提出视频流思维机制以解决视频理解中的响应延迟问题 large language model chain-of-thought
10 EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation EvoTok:通过残差潜在演化统一图像Tokenizer,促进视觉理解与生成 large language model multimodal
11 ZeroSense:How Vision matters in Long Context Compression 提出ZeroSense基准,解耦MLLM能力,更准确评估视觉文本压缩质量 large language model multimodal
12 GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing GRADE:提出首个学科知识驱动的图像编辑基准,评估多模态模型的推理能力。 multimodal
13 HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios 提出HomeSafe-Bench基准测试,用于评估视觉-语言模型在家庭场景中不安全行为的检测能力。 multimodal
14 Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder 提出一种机械公平性审计方法,定位CLIP视觉编码器中注意力头的偏见。 foundation model
15 BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder BackdoorIDS:针对预训练视觉编码器的零样本后门检测方法 multimodal
16 Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild 提出Shape-of-You以解决无标注图像的语义对应问题 foundation model
17 INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs INFACT:用于诊断视频-LLM中诱导的不忠实性和事实性幻觉的基准测试 large language model
18 Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning 提出基于角色特定查询和重叠抑制损失的密集视频字幕方法,提升定位精度和描述质量。 multimodal
19 Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning 提出目标驱动数据优化(GDO)框架,加速多模态指令微调收敛并提升精度。 multimodal
20 SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs SPARROW:像素级视频MLLM,学习空间精确性和时间参照一致性 large language model multimodal visual grounding
21 Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization 提出GenEval,融合人类知识的多模态学习用于单源域泛化 multimodal
22 HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios 提出HomeSafe-Bench评估具身智能体在家庭场景中不安全行为检测的视觉-语言模型 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (19 篇)

#题目一句话要点标签🔗
23 Mobile-GS: Real-time Gaussian Splatting for Mobile Devices Mobile-GS:面向移动设备的高质量实时高斯溅射渲染 distillation 3D gaussian splatting 3DGS
24 LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning LatentGeo:通过隐空间可学习辅助构造提升多模态几何推理能力 reinforcement learning spatial relationship large language model
25 Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling 提出基于双向跨注意力与时序建模的多模态情感识别框架,提升野外视频情感识别性能。 representation learning motion prediction multimodal
26 O3N: Omnidirectional Open-Vocabulary Occupancy Prediction O3N:面向全景开放词汇的三维 occupancy 预测框架 world model Mamba open-vocabulary
27 Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D Hoi3DGen:生成高质量3D人-物交互模型,显著提升文本一致性和模型质量。 distillation human-object interaction large language model
28 EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next EgoIntent:用于理解自我中心视频中意图的步骤级基准测试 imitation learning egocentric large language model
29 IDRL: An Individual-Aware Multimodal Depression-Related Representation Learning Framework for Depression Diagnosis 提出IDRL框架,解决多模态抑郁症诊断中个体差异和模态不一致问题。 representation learning multimodal
30 Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation FIRM:通过鲁棒奖励建模和强化学习实现忠实图像编辑和生成 reinforcement learning instruction following
31 Risk-Controllable Multi-View Diffusion for Driving Scenario Generation 提出RiskMV-DPO,实现风险可控的多视角驾驶场景生成 DPO direct preference optimization world model
32 Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding 提出R-MSD框架,通过多样本蒸馏提升视频理解中LVLM的可靠性。 distillation multimodal
33 SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning 提出SVLL框架,解决具身任务规划中视觉语言模型的时间绑定和物理约束违反问题。 reinforcement learning DPO direct preference optimization
34 Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing AutoGaze:通过自回归注视实现高效可扩展的视频理解 reinforcement learning spatiotemporal large language model
35 Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection DART:一种实时的、无需训练的通用物体检测框架,显著加速SAM3推理。 distillation open-vocabulary open vocabulary
36 DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning DreamVideo-Omni:基于潜在身份强化学习的通用运动控制多主体视频定制 reinforcement learning
37 Linking Perception, Confidence and Accuracy in MLLMs 提出置信度驱动的强化学习与测试时缩放,解决多模态大语言模型中的置信度校准问题 reinforcement learning large language model
38 Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding FutureCAD:提出基于LLM驱动程序生成和文本B-Rep基元接地的CAD高保真生成框架 reinforcement learning large language model
39 InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model InSpatio-WorldFM:开源实时生成式帧模型,实现低延迟空间智能 world model distillation
40 Unleashing Video Language Models for Fine-grained HRCT Report Generation 提出AbSteering框架,利用视频语言模型进行精细化HRCT报告生成。 direct preference optimization foundation model chain-of-thought
41 CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning CalliMaster:通过布局引导的空间规划掌握页面级中文书法生成 flow matching multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
42 MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation MV-SAM3D:自适应多视角融合的布局感知3D生成 sam 3D SAM 3D physically plausible
43 SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈Agent open-vocabulary open vocabulary spatial relationship
44 Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos 提出双阶段优化框架,解决多视角视频稠密动态场景重建与相机位姿估计问题。 visual SLAM scene reconstruction optical flow
45 AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies AstroSplat:基于物理的Gaussian Splatting用于小天体渲染与重建 gaussian splatting splatting
46 Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting Mango-GS:利用多帧节点引导的4D高斯溅射增强动态场景重建的时空一致性 gaussian splatting splatting
47 Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs Node-RF:基于神经ODE的NeRF学习广义连续时空场景动态 NeRF neural radiance field spatiotemporal
48 DVD: Deterministic Video Depth Estimation with Generative Priors DVD:利用生成先验实现确定性视频深度估计 depth estimation foundation model
49 MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks MANSION:提出多楼层语言驱动的3D场景生成框架,用于长时程任务。 open-vocabulary open vocabulary
50 NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction 提出NBAvatar,通过神经渲染实现逼真手-脸交互头部化身 implicit representation
51 CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing CEI-3D:协同显隐式3D重建,实现逼真细粒度的物体编辑 implicit representation
52 Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation 提出Surg-R1以解决外科决策支持中的可解释性问题 scene understanding foundation model chain-of-thought
53 ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection 提出ABRA,解决开放词汇目标检测在跨域场景下的知识迁移问题 open-vocabulary open vocabulary
54 Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs Node-RF:基于神经ODE的NeRF学习广义连续时空场景动态 NeRF neural radiance field spatiotemporal

🔬 支柱一:机器人控制 (Robot Control) (7 篇)

#题目一句话要点标签🔗
55 Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints 提出基于遮挡感知稀疏3D手部关节点的可控自中心视频生成方法 humanoid egocentric cross-embodiment
56 Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation Ada3Drift:通过自适应训练时漂移实现单步3D视觉伺服机器人操作 manipulation flow matching multimodal
57 Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary DORI:一个认知驱动的基准测试,揭示MLLM在物体朝向理解上的系统性失败 manipulation scene reconstruction scene understanding
58 OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams OmniStream:提出统一的流式视觉主干网络,实现感知、重建和动作的统一建模。 manipulation representation learning spatiotemporal
59 ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models 提出ForensicZip以解决高分辨率图像取证中的计算成本问题 manipulation large language model multimodal
60 OSCBench: Benchmarking Object State Change in Text-to-Video Generation 提出OSCBench以解决文本到视频生成中的对象状态变化问题 OSC large language model multimodal
61 WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing WeEdit:提出字形引导的文本图像编辑框架,并构建大规模数据集与评测基准。 manipulation reinforcement learning

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
62 LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference LaMoGen:通过LLM引导的符号推理实现语言到动作的生成 motion synthesis motion generation human motion
63 Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints Articulat3D:提出几何与运动约束,从单目视频重建可动数字孪生 physically plausible
64 Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance 提出流形最优引导(MOG)框架,解决扩散模型条件引导中的过饱和和伪影问题 classifier-free guidance

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
65 Pano360: Perspective to Panoramic Vision with Geometric Consistency 提出基于几何一致性的透视到全景视觉转换方法Pano360,提升全景图拼接质量 geometric consistency
66 Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments 评估RGB-NIR图像配准技术在非公路林业环境中的适用性 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
67 Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training 提出Spatial-TTT,通过测试时训练实现基于视频流的空间智能。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
68 Revisiting Model Stitching In the Foundation Model Era 重探基础模型时代的模型缝合技术,实现异构视觉基础模型的有效集成。 feature matching foundation model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页