cs.CV（2026-03-09）

📊 共 51 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (11 🔗1) 支柱九：具身大模型 (Embodied Foundation Models) (10 🔗1) 支柱一：机器人控制 (Robot Control) (5 🔗2) 支柱七：动作重定向 (Motion Retargeting) (5 🔗1) 支柱八：物理动画 (Physics-based Animation) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving	提出SAMoE-VLA，通过场景自适应MoE提升自动驾驶VLA模型的性能与安全性。	world model vision-language-action VLA
2	Toward Unified Multimodal Representation Learning for Autonomous Driving	提出对比张量预训练框架，用于自动驾驶多模态统一表征学习	representation learning contrastive learning scene understanding
3	SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation	提出SGG-R$^{ m 3}$以解决场景图生成中的偏见与稀疏问题	reinforcement learning large language model multimodal
4	MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models	MINT：利用空间转录组监督的病理学Foundation模型分子信息训练	distillation foundation model
5	Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model	提出基于Mamba和扩散模型的MambaDance，解决舞蹈生成中时序建模和节拍同步问题	Mamba human motion	✅
6	Geometric Transformation-Embedded Mamba for Learned Video Compression	提出几何变换嵌入的Mamba模型，用于提升学习型视频压缩的性能。	Mamba motion estimation	✅
7	BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images	BuildMamba：用于卫星图像多任务建筑物分割与高度估计的视觉状态空间模型	Mamba monocular depth
8	It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models	提出TickTockVQA以解决视觉语言模型在模拟时钟阅读中的挑战	DPO direct preference optimization multimodal
9	ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation	ER-Pose：重新思考关键点驱动的单阶段人体姿态估计，提升精度与效率	representation learning
10	SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents	SPIRAL：通过自反规划智能体实现自改进动作世界模型的闭环框架	world model
11	SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval	提出SAVE模型，通过语音感知视频表征学习提升视频-文本检索性能	representation learning
12	MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data	MM-TS：多模态对比学习中基于长尾数据的温度和Margin动态调整方法	contrastive learning
13	ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning	ImageEdit-R1：强化学习驱动的多智能体图像编辑框架	reinforcement learning
14	Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared	提出一种字典引导的跨模态图像融合框架，解决缺失红外图像融合问题。	representation learning large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
15	ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting	ImprovedGS+：通过C++/CUDA重构，显著提升3D高斯溅射的训练速度与质量。	3D gaussian splatting 3DGS gaussian splatting
16	HDR-NSFF: High Dynamic Range Neural Scene Flow Fields	提出HDR-NSFF，用于从单目交替曝光视频中重建动态高动态范围场景。	gaussian splatting splatting neural radiance field	✅
17	DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving	DynamicVGGT：学习动态点图，用于自动驾驶中的4D场景重建	3D gaussian splatting gaussian splatting splatting
18	Improving Continual Learning for Gaussian Splatting based Environments Reconstruction on Commercial Off-the-Shelf Edge Devices	提出精度自适应优化框架，实现边缘设备上高斯溅射环境重建的持续学习。	3DGS gaussian splatting splatting
19	ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation	提出ViSA框架，增强视觉空间推理，提升无人机视觉语言导航性能	open-vocabulary open vocabulary VLN
20	FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection	FOMO-3D：利用视觉基础模型解决长尾3D目标检测问题	Metric3D foundation model
21	Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors	提出对齐感知和可靠性门控的多模态融合方法，提升异构热成像-可见光无人机检测性能	optical flow multimodal
22	OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations	提出基于声学神经隐式表达的椎体超声图像补全方法，解决遮挡和信号变化问题	implicit representation
23	Fast Low-light Enhancement and Deblurring for 3D Dark Scenes	FLED-GS：快速低光增强与去模糊的三维暗场景重建框架	3DGS NeRF
24	Event-based Motion & Appearance Fusion for 6D Object Pose Tracking	提出基于事件相机运动与外观融合的6D物体姿态跟踪方法，适用于高动态场景。	optical flow
25	Speed3R: Sparse Feed-forward 3D Reconstruction Models	Speed3R：稀疏前馈3D重建模型，显著提升重建速度	VGGT

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
26	AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models	AutoTraces：利用多模态大语言模型进行自回归轨迹预测，适用于人机共存环境。	large language model multimodal chain-of-thought
27	Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models	利用多模态大语言模型合成缺陷图像，提升电力线绝缘子巡检性能	large language model multimodal
28	MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals	MERLIN：构建低信噪比鲁棒的多模态LLM，用于电磁信号处理	large language model multimodal
29	AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition	AULLM++：利用大语言模型的结构化推理进行微表情识别	large language model
30	SiMO: Single-Modality-Operable Multimodal Collaborative Perception	提出SiMO，解决多模态协同感知中单模态失效时的性能退化问题	multimodal	✅
31	Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations	提出GR3D，增强MLLM在几何参考3D场景中的空间推理能力	large language model multimodal
32	SecAgent: Efficient Mobile GUI Agent with Semantic Context	SecAgent：基于语义上下文的高效移动GUI智能体	large language model multimodal
33	Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images	提出视觉自洽对齐(VSFA)，通过威胁图像塑造安全导向的多模态大模型	large language model multimodal
34	From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation	提出基于地图的AI方法，利用微调LLM进行语义区域推理，提升ObjectNav任务性能。	large language model
35	Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis	研究比较了直接图像合成与TikZ代码生成在自动机图转换中的性能，旨在辅助计算机科学教学。	large language model

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
36	$Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation	提出$Δ$VLA，通过世界知识变化先验引导的VLA模型，提升机器人操作性能。	manipulation VQ-VAE vision-language-action	✅
37	TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size	TeamHOI：学习统一策略，实现任意数量智能体协同的人-物交互	humanoid humanoid control physically plausible
38	Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction	提出Spherical-GOF，解决全景图像三维重建中的几何不一致性问题。	quadruped 3D gaussian splatting 3DGS	✅
39	Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations	利用最小可识别区域，研究人与AI在自中心动作识别上的差异	manipulation egocentric spatiotemporal
40	X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection	提出X-AVDT，利用音视频跨注意力机制实现鲁棒的Deepfake检测	manipulation flow matching multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades	提出基于文本到骨骼级联的可控复杂人体运动视频生成框架。	human motion
42	Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking	Fusion-Poly：基于时空融合的3D多目标跟踪多面体框架	motion prediction TAMP
43	TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization	TrianguLang：提出几何感知语义共识的无姿态3D定位方法	geometric consistency embodied AI	✅
44	Talking Together: Synthesizing Co-Located 3D Conversations from Audio	提出一种新方法以合成共处的3D对话动画	spatial relationship
45	VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion	VSDiffusion：通过可见性约束扩散模型解决阴影生成难题	geometric consistency

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
46	Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout	提出基于安全交叉注意力和模态Dropout的鲁棒多模态框架，解决ABAW表情识别挑战。	spatiotemporal multimodal
47	Can Vision-Language Models Solve the Shell Game?	提出SGCoT方法，解决视觉语言模型在视觉实体跟踪任务中的时序推理难题	spatiotemporal chain-of-thought
48	This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse	提出自适应流形原型(AMP)框架，解决原型网络中的原型坍塌问题，提升细粒度识别的解释性和准确率。	AMP
49	Adaptive MLP Pruning for Large Vision Transformers	提出自适应MLP剪枝方法，在不损失性能的前提下显著降低大型视觉Transformer的参数量。	AMP	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
50	PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition	PRISM：提出基于关节分解的流式人体运动生成方法，显著提升生成质量。	text-to-motion motion generation motion latent

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
51	Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time	提出EcoG-Bench基准测试，用于评估具身智能体在共现语音指示下的时空定位能力	egocentric multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页