cs.CV（2026-05-12）

📊 共 68 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (23 🔗6) 支柱九：具身大模型 (Embodied Foundation Models) (21 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (15 🔗2) 支柱一：机器人控制 (Robot Control) (4 🔗2) 支柱四：生成式动作 (Generative Motion) (2) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (23 篇)

#	题目	一句话要点	标签	🔗
1	PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting	PointGS：利用3D高斯溅射实现语义一致的无监督3D点云分割	contrastive learning 3D gaussian splatting gaussian splatting
2	SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture	SenseNova-U1：基于NEO-unify架构的统一多模态理解与生成模型	world model world models vision-language-action
3	PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting	提出PairDropGS以解决稀疏视图高斯点云重建不稳定问题	representation learning 3D gaussian splatting 3DGS
4	Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training	VISTA：提出视觉感知自提升训练框架，提升多模态大语言模型的推理能力	preference learning large language model multimodal
5	Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation	提出雷达调制选择机制以解决雷达-相机深度估计问题	Mamba state space model MAE
6	Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction	Lite3R：一种模型无关的高效前馈3D重建框架，降低计算开销并保持精度。	linear attention teacher-student distillation	✅
7	Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images	提出PCSR-Bench基准，诊断MLLM在全景图像中视角条件下的空间推理能力	reward design reward shaping egocentric
8	CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating	提出基于视觉-语言模型的CaC框架，用于提升视频异常检测的准确性和可解释性。	reinforcement learning spatiotemporal chain-of-thought
9	HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation	HorizonDrive：用于长时程驾驶模拟的自校正自回归世界模型	world model world models distillation
10	TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles	提出TCP-SSM，通过token条件极点改进视觉状态空间模型的效率与可解释性。	Mamba SSM state space model
11	VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference	提出VIP：视觉引导的Prompt进化方法，高效实现密集视觉-语言推理。	VIP distillation open-vocabulary	✅
12	SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning	提出SyncDPO以解决视频音频联合生成中的时间同步问题	preference learning DPO direct preference optimization	✅
13	3D-Belief: Embodied Belief Inference via Generative 3D World Modeling	提出3D-Belief，通过生成式3D世界建模实现具身信念推理。	world model world models
14	Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos	针对结肠镜视频，提出噪声感知的时序自监督对比学习方法	contrastive learning foundation model	✅
15	Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation	提出基于推理前缀掩码的视觉锚定蒸馏方法，提升VLM在多模态推理中的视觉信息利用率。	distillation multimodal
16	When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy	提出感知熵约束，解决Flow模型RLHF微调中多样性崩溃问题	flow matching RLHF	✅
17	Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution	提出基于交互状态空间模型的跨模态局部扫描深度超分辨率方法	Mamba state space model
18	The DAWN of World-Action Interactive Models	提出WAIM以解决世界预测与动作生成的相互依赖问题	world model world models world action model
19	FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity	提出FIS-DiT，通过无训练帧交错稀疏性突破视频扩散模型推理速度瓶颈。	predictive model distillation spatiotemporal
20	Large-Small Model Collaboration for Farmland Semantic Change Detection	提出大小模型协同框架，用于解决农田语义变化检测中的伪变化问题。	Mamba multimodal	✅
21	Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations	提出CoDAAR，通过语义对齐的离散表示实现跨模态领域泛化	representation learning multimodal
22	Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data	提出SAGL方法，从异构多视图数据中学习保持子空间的稀疏注意力图，用于无监督迁移学习。	linear attention representation learning
23	DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers	提出DORA：一种基于强化学习的ViT动态Token融合在线推理方法	reinforcement learning distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

#	题目	一句话要点	标签	🔗
24	Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models	提出GAP：一种用于多模态大语言模型中视觉推理的细粒度对齐范式	large language model multimodal
25	Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts	提出基于多模态大语言模型的全能图像复原框架，解决复杂退化建模问题。	large language model multimodal
26	UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs	UniVLR：统一文本与视觉潜在推理，提升多模态LLM的视觉思维效率	large language model multimodal chain-of-thought
27	Dynamic Execution Commitment of Vision-Language-Action Models	提出A3自适应动作接受机制，解决VLA模型动态执行承诺问题	vision-language-action VLA
28	AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward	AlphaGRPO：通过可分解验证奖励解锁UMM中的自反思多模态生成	multimodal	✅
29	G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models	提出G$^2$TR，通过生成引导的视觉token缩减，提升分离编码器统一多模态模型的推理效率。	multimodal
30	Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models	提出ClipSum框架，利用冻结CLIP视觉-语言特征进行教学视频多模态摘要生成。	multimodal	✅
31	OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models	提出OTT-Vid，通过最优传输进行时序token压缩，提升Video-LLM效率。	large language model
32	CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection	提出CAST框架，通过融合多尺度拓扑结构进行多模态数据集高效子集选择。	multimodal
33	Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment	Instruct-ICL：利用指令引导的上下文学习提升灾后损失评估多模态大语言模型性能	large language model multimodal chain-of-thought
34	PresentAgent-2: Towards Generalist Multimodal Presentation Agents	提出PresentAgent-2，实现通用多模态演示代理，支持多种演示模式。	multimodal	✅
35	Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs	ContextGuard：面向Omni-LLM的上下文保持型Token剪枝框架，提升效率并保持性能	large language model multimodal
36	Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration	提出Logit-Attention Divergence方法，解决多图检索中由注意力偏差引起的位置偏见问题	large language model multimodal	✅
37	When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs	利用视觉注意力结构揭示多模态大语言模型中的幻觉现象，并提出LaSCD解码策略	large language model multimodal	✅
38	LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs	提出LDDR：基于线性DPP的动态分辨率视频帧采样方法，提升视频MLLM性能	large language model multimodal
39	Elastic Attention Cores for Scalable Vision Transformers	提出VECA：通过弹性注意力核心实现可扩展的视觉Transformer	foundation model
40	H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes	提出H2G：一种层级感知的双曲空间分组方法，用于三维场景理解	foundation model
41	Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters	提出Chronicles-OCR，用于评估VLLM在汉字演化轨迹上的跨时序视觉感知能力	large language model	✅
42	SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions	SB-BEVFusion：增强多模态融合在传感器故障和数据损坏下的鲁棒性	multimodal
43	M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection	M$^4$-SAM：面向RGB-D视频显著性目标检测的记忆增强多模态混合专家模型	foundation model
44	ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes	ShapeCodeBench：用于合成形状场景感知到程序重建的可再生基准	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (15 篇)

#	题目	一句话要点	标签	🔗
45	3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark	针对同步多视角动态场景，提出高效的3D高斯溅射新视角合成方法。	3D gaussian splatting 3DGS gaussian splatting
46	Focusable Monocular Depth Estimation	提出FocusDepth，解决单目深度估计中目标区域深度精度不足的问题。	depth estimation monocular depth Depth Anything
47	Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction	AmbiSuR：基于高斯溅射的鲁棒光度歧义表面重建框架	3D reconstruction gaussian splatting splatting	✅
48	PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming	提出PD-4DGS，实现4D高斯溅射的渐进式分解，用于带宽自适应的动态场景流式传输。	3DGS gaussian splatting splatting
49	VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors	VidSplat：利用几何引导的视频扩散先验实现高斯溅射重建，提升稀疏视图下的三维重建效果。	gaussian splatting splatting scene reconstruction
50	4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation	提出4DVGGT-D，通过动态深度估计改进4D视觉几何Transformer，用于单目视频动态场景重建。	depth estimation scene reconstruction foundation model
51	GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction	GeoQuery：几何引导的扩散模型用于稀疏视角三维重建，提升重建质量。	3D gaussian splatting 3DGS 3D reconstruction
52	PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization	提出PoseCompass以解决视觉定位中的合成姿态选择问题	3D gaussian splatting 3DGS gaussian splatting
53	BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding	BARISTA：一个用于组合视觉理解的多任务自中心视角基准数据集	scene understanding egocentric	✅
54	Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision	提出基于VQA事件地图的城市风险感知导航系统，辅助低视力人群安全出行。	scene understanding large language model multimodal
55	PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations	PointForward：提出基于点对齐表示的feedforward自动驾驶场景重建方法	3D gaussian splatting 3DGS gaussian splatting
56	The Midas Touch for Metric Depth	提出MTD方法，利用极稀疏3D数据将相对深度转换为度量深度，提升跨场景泛化能力。	depth estimation metric depth
57	Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances	AFFORDMEM：利用跨场景与场景内记忆实现3D功能可供性定位	affordance
58	LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing	提出LiBrA-Net以解决超高清4K视频去雾问题	optical flow spatiotemporal
59	TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion	提出TriBand-BEV，通过高度感知BEV和高分辨率特征融合实现实时LiDAR行人3D检测。	3D reconstruction

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗
60	OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation	OmniHumanoid：提出一种无需配对数据自适应的跨具身人形视频生成框架	humanoid human-to-robot cross-embodiment
61	EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras	EgoEV-HandPose：利用立体事件相机进行第一人称3D手部姿态估计和手势识别	bi-manual monocular depth egocentric	✅
62	EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera	EgoForce：利用前臂引导的相机空间3D手部姿态单目估计	manipulation egocentric hand reconstruction	✅
63	WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting	WildRelight：提出真实世界单图重光照基准与物理引导的自适应方法	sim-to-real

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
64	ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation	提出ScaleMoGen框架以解决人类动作生成中的细粒度预测问题	motion generation motion tokenizer MoMask
65	Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers	提出一种融合预训练模块化控制器的动态全身人-物交互运动生成框架	motion diffusion model motion diffusion physically plausible

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
66	GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization	GaitProtector：通过无训练扩散潜在空间优化实现基于模仿的步态去识别	latent optimization spatiotemporal
67	EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion	EchoTracker2：通过建模局部运动增强心肌点追踪	motion estimation spatiotemporal	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
68	PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition	提出PoseBridge以解决零样本骨架动作识别中的语义损失问题	human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-05-12）

🎯 兴趣领域导航

🔬 支柱二：RL算法与架构 (RL & Architecture) (23 篇)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (15 篇)

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理