cs.CV(2026-03-09)

📊 共 51 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (10 🔗1) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱七:动作重定向 (Motion Retargeting) (5 🔗1) 支柱八:物理动画 (Physics-based Animation) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
1 SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving 提出SAMoE-VLA,通过场景自适应MoE提升自动驾驶VLA模型的性能与安全性。 world model vision-language-action VLA
2 Toward Unified Multimodal Representation Learning for Autonomous Driving 提出对比张量预训练框架,用于自动驾驶多模态统一表征学习 representation learning contrastive learning scene understanding
3 SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation 提出SGG-R$^{ m 3}$以解决场景图生成中的偏见与稀疏问题 reinforcement learning large language model multimodal
4 MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models MINT:利用空间转录组监督的病理学Foundation模型分子信息训练 distillation foundation model
5 Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model 提出基于Mamba和扩散模型的MambaDance,解决舞蹈生成中时序建模和节拍同步问题 Mamba human motion
6 Geometric Transformation-Embedded Mamba for Learned Video Compression 提出几何变换嵌入的Mamba模型,用于提升学习型视频压缩的性能。 Mamba motion estimation
7 BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images BuildMamba:用于卫星图像多任务建筑物分割与高度估计的视觉状态空间模型 Mamba monocular depth
8 It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models 提出TickTockVQA以解决视觉语言模型在模拟时钟阅读中的挑战 DPO direct preference optimization multimodal
9 ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation ER-Pose:重新思考关键点驱动的单阶段人体姿态估计,提升精度与效率 representation learning
10 SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents SPIRAL:通过自反规划智能体实现自改进动作世界模型的闭环框架 world model
11 SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval 提出SAVE模型,通过语音感知视频表征学习提升视频-文本检索性能 representation learning
12 MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data MM-TS:多模态对比学习中基于长尾数据的温度和Margin动态调整方法 contrastive learning
13 ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning ImageEdit-R1:强化学习驱动的多智能体图像编辑框架 reinforcement learning
14 Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared 提出一种字典引导的跨模态图像融合框架,解决缺失红外图像融合问题。 representation learning large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
15 ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting ImprovedGS+:通过C++/CUDA重构,显著提升3D高斯溅射的训练速度与质量。 3D gaussian splatting 3DGS gaussian splatting
16 HDR-NSFF: High Dynamic Range Neural Scene Flow Fields 提出HDR-NSFF,用于从单目交替曝光视频中重建动态高动态范围场景。 gaussian splatting splatting neural radiance field
17 DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving DynamicVGGT:学习动态点图,用于自动驾驶中的4D场景重建 3D gaussian splatting gaussian splatting splatting
18 Improving Continual Learning for Gaussian Splatting based Environments Reconstruction on Commercial Off-the-Shelf Edge Devices 提出精度自适应优化框架,实现边缘设备上高斯溅射环境重建的持续学习。 3DGS gaussian splatting splatting
19 ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation 提出ViSA框架,增强视觉空间推理,提升无人机视觉语言导航性能 open-vocabulary open vocabulary VLN
20 FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection FOMO-3D:利用视觉基础模型解决长尾3D目标检测问题 Metric3D foundation model
21 Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors 提出对齐感知和可靠性门控的多模态融合方法,提升异构热成像-可见光无人机检测性能 optical flow multimodal
22 OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations 提出基于声学神经隐式表达的椎体超声图像补全方法,解决遮挡和信号变化问题 implicit representation
23 Fast Low-light Enhancement and Deblurring for 3D Dark Scenes FLED-GS:快速低光增强与去模糊的三维暗场景重建框架 3DGS NeRF
24 Event-based Motion & Appearance Fusion for 6D Object Pose Tracking 提出基于事件相机运动与外观融合的6D物体姿态跟踪方法,适用于高动态场景。 optical flow
25 Speed3R: Sparse Feed-forward 3D Reconstruction Models Speed3R:稀疏前馈3D重建模型,显著提升重建速度 VGGT

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
26 AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models AutoTraces:利用多模态大语言模型进行自回归轨迹预测,适用于人机共存环境。 large language model multimodal chain-of-thought
27 Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models 利用多模态大语言模型合成缺陷图像,提升电力线绝缘子巡检性能 large language model multimodal
28 MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals MERLIN:构建低信噪比鲁棒的多模态LLM,用于电磁信号处理 large language model multimodal
29 AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition AULLM++:利用大语言模型的结构化推理进行微表情识别 large language model
30 SiMO: Single-Modality-Operable Multimodal Collaborative Perception 提出SiMO,解决多模态协同感知中单模态失效时的性能退化问题 multimodal
31 Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations 提出GR3D,增强MLLM在几何参考3D场景中的空间推理能力 large language model multimodal
32 SecAgent: Efficient Mobile GUI Agent with Semantic Context SecAgent:基于语义上下文的高效移动GUI智能体 large language model multimodal
33 Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images 提出视觉自洽对齐(VSFA),通过威胁图像塑造安全导向的多模态大模型 large language model multimodal
34 From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation 提出基于地图的AI方法,利用微调LLM进行语义区域推理,提升ObjectNav任务性能。 large language model
35 Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis 研究比较了直接图像合成与TikZ代码生成在自动机图转换中的性能,旨在辅助计算机科学教学。 large language model

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
36 $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation 提出$Δ$VLA,通过世界知识变化先验引导的VLA模型,提升机器人操作性能。 manipulation VQ-VAE vision-language-action
37 TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size TeamHOI:学习统一策略,实现任意数量智能体协同的人-物交互 humanoid humanoid control physically plausible
38 Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction 提出Spherical-GOF,解决全景图像三维重建中的几何不一致性问题。 quadruped 3D gaussian splatting 3DGS
39 Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations 利用最小可识别区域,研究人与AI在自中心动作识别上的差异 manipulation egocentric spatiotemporal
40 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 提出X-AVDT,利用音视频跨注意力机制实现鲁棒的Deepfake检测 manipulation flow matching multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (5 篇)

#题目一句话要点标签🔗
41 Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades 提出基于文本到骨骼级联的可控复杂人体运动视频生成框架。 human motion
42 Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking Fusion-Poly:基于时空融合的3D多目标跟踪多面体框架 motion prediction TAMP
43 TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization TrianguLang:提出几何感知语义共识的无姿态3D定位方法 geometric consistency embodied AI
44 Talking Together: Synthesizing Co-Located 3D Conversations from Audio 提出一种新方法以合成共处的3D对话动画 spatial relationship
45 VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion VSDiffusion:通过可见性约束扩散模型解决阴影生成难题 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)

#题目一句话要点标签🔗
46 Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout 提出基于安全交叉注意力和模态Dropout的鲁棒多模态框架,解决ABAW表情识别挑战。 spatiotemporal multimodal
47 Can Vision-Language Models Solve the Shell Game? 提出SGCoT方法,解决视觉语言模型在视觉实体跟踪任务中的时序推理难题 spatiotemporal chain-of-thought
48 This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse 提出自适应流形原型(AMP)框架,解决原型网络中的原型坍塌问题,提升细粒度识别的解释性和准确率。 AMP
49 Adaptive MLP Pruning for Large Vision Transformers 提出自适应MLP剪枝方法,在不损失性能的前提下显著降低大型视觉Transformer的参数量。 AMP

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
50 PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition PRISM:提出基于关节分解的流式人体运动生成方法,显著提升生成质量。 text-to-motion motion generation motion latent

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
51 Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time 提出EcoG-Bench基准测试,用于评估具身智能体在共现语音指示下的时空定位能力 egocentric multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页