cs.CV(2026-03-10)

📊 共 60 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (21 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗3) 支柱四:生成式动作 (Generative Motion) (5) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (2 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (21 篇)

#题目一句话要点标签🔗
1 GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models 提出GST-VLA以解决3D深度感知视觉-语言-动作模型的几何结构问题 flow matching affordance vision-language-action
2 GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System 提出GSStream,基于3D高斯溅射的体渲染场景流式传输系统,优化带宽占用。 reinforcement learning deep reinforcement learning DRL
3 OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models OddGridBench揭示多模态大模型在细粒度视觉差异感知上的不足 reinforcement learning curriculum learning reward design
4 EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation EventVGGT:探索跨模态蒸馏,实现事件相机一致性深度估计 distillation depth estimation monocular depth
5 Multimodal Graph Representation Learning with Dynamic Information Pathways 提出基于动态信息路径的多模态图表示学习框架,提升异构图数据的学习能力。 representation learning multimodal
6 Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities 提出PRLF框架,解决多模态情感分析中模态缺失带来的特征错位问题。 representation learning multimodal
7 AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering 提出AutoViVQA:一个大规模自动构建的越南语视觉问答数据集。 representation learning visual pre-training large language model
8 Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization 提出基于强化学习的后训练策略,实现统一多模态模型中的交错生成能力。 reinforcement learning multimodal
9 Progressive Split Mamba: Effective State Space Modelling for Image Restoration 提出Progressive Split Mamba,有效解决图像复原中长程依赖建模问题。 Mamba SSM state space model
10 From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding C2FMAE:提出粗到精掩码自编码器,用于分层视觉理解 masked autoencoder contrastive learning visual pre-training
11 Streaming Autoregressive Video Generation via Diagonal Distillation 提出对角蒸馏方法,加速自回归视频生成,实现实时流式传输。 distillation optical flow
12 Decoder-Free Distillation for Quantized Image Restoration 提出QDR框架,通过无解码器蒸馏和可学习权重,实现量化图像恢复的性能提升。 teacher-student distillation
13 Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking MDTrack:针对多模态目标跟踪,提出模态感知融合与解耦时序传播方法 SSM state space model multimodal
14 UniField: A Unified Field-Aware MRI Enhancement Framework 提出UniField统一框架,利用多场强MRI数据提升增强效果和泛化性。 flow matching representation learning foundation model
15 RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning RubiCap:一种基于规则引导的强化学习方法,用于密集图像描述生成。 reinforcement learning distillation
16 WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion 提出WS-Net以解决弱信号超光谱解混合问题 Mamba representation learning
17 RAE-NWM: Navigation World Model in Dense Visual Representation Space 提出RAE-NWM以解决视觉导航中的状态演变问题 world model
18 M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition 提出M3GCLR框架,通过多视角对抗对比学习提升骨骼动作识别精度。 contrastive learning
19 ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph ForgeDreamer提出多专家LoRA与跨视角超图,解决工业级文本到3D生成难题。 dreamer
20 IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework 提出IntroSVG框架,通过生成器-评论家自省学习提升文本到SVG的生成质量。 DPO direct preference optimization
21 RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation 提出RTFDNet,通过融合解耦实现鲁棒的RGB-T语义分割 teacher-student distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
22 InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing InternVL-U:提出轻量级统一多模态模型,兼顾理解、推理、生成与编辑能力。 large language model multimodal chain-of-thought
23 TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy 提出TubeMLLM,用于血管解剖结构中拓扑知识探索的统一基础模型 large language model foundation model multimodal
24 Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation ACADiff:基于临床感知的自适应潜在扩散模型,用于多模态脑图像生成和模态缺失填补 multimodal
25 MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities MissBench:针对模态缺失不平衡的多模态情感分析基准测试框架 multimodal
26 FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis 提出FetalAgents,用于胎儿超声图像和视频分析的多智能体系统。 large language model foundation model multimodal
27 Point Cloud as a Foreign Language for Multi-modal Large Language Model 提出SAGE:首个端到端3D多模态大语言模型,直接处理原始点云数据。 large language model
28 MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data MM-Zero:首个零数据自进化多模型视觉语言模型框架 large language model multimodal
29 QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model QUSR:面向真实场景,提出质量感知和不确定性引导的图像超分辨率扩散模型 large language model multimodal
30 Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos 提出基于Web视频的视觉-语言导航框架,利用隐式几何表示提升导航性能。 VLN
31 WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition WikiCLIP:一种高效的对比学习基线,用于开放域视觉实体识别。 large language model
32 Ego: Embedding-Guided Personalization of Vision-Language Models 提出一种高效个性化方法以提升视觉语言模型的用户体验 multimodal
33 Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments 提出STAR基准,评估LLM在零和博弈环境下的战略推理和快速决策能力 large language model
34 OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing OmniEdit:一种免训练的唇形同步和音视频编辑框架 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
35 X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models X-GS:统一3DGS架构与多模态模型的开放框架,实现语义增强的实时SLAM。 3D gaussian splatting 3DGS gaussian splatting
36 DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction DenoiseSplat:用于噪声场景重建的前馈高斯溅射方法 3D gaussian splatting gaussian splatting splatting
37 ProGS: Towards Progressive Coding for 3D Gaussian Splatting ProGS:面向3D高斯溅射的渐进式编码,提升压缩效率与视觉保真度 3D gaussian splatting 3DGS gaussian splatting
38 VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM VarSplat:面向RGB-D SLAM的、不确定性感知的鲁棒3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
39 ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare ReCoSplat:基于渲染对比的自回归前馈高斯溅射,用于在线新视角合成。 gaussian splatting splatting scene reconstruction
40 Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists 提出高斯列表缩短方法,加速3D高斯溅射的辐射场学习 3D gaussian splatting 3DGS gaussian splatting
41 PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments PanoAffordanceNet:面向360°室内环境的整体可供性推理 affordance
42 OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty 提出基于最优传输线特征关联和自适应不确定性的鲁棒视觉惯性里程计 VIO
43 DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics DiffWind:提出物理信息可微框架,用于风驱动物体动态建模。 3D gaussian splatting gaussian splatting splatting
44 SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation 提出SpaceSense-Bench大规模多模态数据集,用于航天器感知与姿态估计研究。 depth estimation monocular depth
45 RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding 提出RA-SSU任务和SSUFormer模型,实现细粒度音视频场景理解 scene understanding
46 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes 提出全景语言模型以解决多视角理解不足问题 scene understanding

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
47 ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis ParTY:通过部件引导实现富有表现力的文本到动作合成 text-to-motion motion synthesis motion generation
48 Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction 提出基于关节角度运动图像和Token-Patch交互的细粒度动作检索方法 text-to-motion motion retrieval human motion
49 Chain of Event-Centric Causal Thought for Physically Plausible Video Generation 提出事件链因果推理框架,用于生成物理上合理的视频 physically plausible large language model chain-of-thought
50 Training-free Motion Factorization for Compositional Video Generation 提出一种无训练的运动分解框架,用于可组合视频生成,提升运动多样性。 motion synthesis
51 When to Lock Attention: Training-Free KV Control in Video Diffusion 提出KV-Lock,一种免训练的视频扩散模型KV控制方法,提升前景质量并保持背景一致性。 classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
52 EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning 提出EXPLORE-Bench基准,评估MLLM在长时程自中心场景预测中的推理能力 egocentric large language model multimodal
53 MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents 提出MA-EgoQA基准,用于评估多智能体具身环境中基于第一视角视频的问答能力。 egocentric embodied AI
54 Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency 提出DCPGN,用于测试时视角自适应的动作预测,提升人机协作效率。 egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
55 Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture FootMR:通过2D关键点提升单目人体运动捕捉中的3D足部运动重建 human motion motion reconstruction
56 Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports 提出CourtSI数据集与基准,评估并提升VLMs在体育场景中的空间智能 human motion

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
57 EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation EvoDriveVLA:通过协同感知-规划蒸馏演进自动驾驶视觉-语言-动作模型 trajectory optimization distillation vision-language-action
58 When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection 提出几何语义解耦(GSD)模块,提升AI生成图像检测的泛化能力。 manipulation foundation model

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
59 DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary 提出DISPLAY框架以解决可控人机交互视频生成问题 human-object interaction HOI
60 ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios ENIGMA-360:提出一个工业场景下用于人类行为理解的自中心-他中心视角数据集。 human-object interaction egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页