cs.CV（2025-12-31）

📊 共 19 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱一：机器人控制 (Robot Control) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents	VLN-MME：诊断多模态大语言模型在语言引导视觉导航任务中的能力	VLN large language model multimodal
2	FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation	FinMMDocR：提出金融多模态推理基准，关注场景感知、文档理解和多步计算。	large language model multimodal
3	A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data	提出SMAGNet，利用SAR和不完整MSI数据进行洪水范围精确高效的多模态映射。	multimodal
4	MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding	提出MoniRefer数据集和Moni3DVG方法，用于路侧基础设施的3D视觉定位。	visual grounding
5	RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios	提出RGBT-Ground基准，用于评估复杂场景下RGB-T图像的视觉定位能力	visual grounding
6	UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images	提出UR-Bench，用于评估多模态大模型在超高分辨率图像上的多跳推理能力。	large language model multimodal
7	EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation	提出EchoFoley任务与EchoVidia框架，用于视频事件驱动的精细化创意声音生成。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
8	Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression	Splatwizard：用于3D高斯溅射压缩的综合基准测试工具包	3D gaussian splatting 3DGS gaussian splatting	✅
9	FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM	FoundationSLAM：利用深度基础模型实现端到端稠密视觉SLAM	visual SLAM geometric consistency foundation model
10	Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark	提出Spatial4D-Bench，用于全面评估多模态大语言模型在4D空间智能方面的能力。	scene understanding spatial relationship spatiotemporal
11	Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation	提出基于物理环路优化的投影对抗攻击，用于单目深度估计	depth estimation monocular depth
12	HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films	提出HaineiFRDM，利用扩散模型修复快速移动影片中的缺陷。	optical flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
13	UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning	UniC-Lift：通过对比学习实现统一的3D实例分割	contrastive learning 3D gaussian splatting 3DGS
14	TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model	TeleWorld：基于4D世界模型的动态多模态实时合成框架	world model distillation scene reconstruction
15	VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition	VideoCuRL：提出正交难度分解的视频课程强化学习，提升视频理解能力。	reinforcement learning optical flow spatiotemporal
16	PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation	提出PhyGDPO框架，通过物理感知的群体偏好优化实现物理一致的文本生成视频。	direct preference optimization chain-of-thought	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands	ShowUI-$π$：提出基于Flow的生成模型，实现GUI界面的灵巧操作。	manipulation dexterous hand dexterous manipulation	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression	提出一种分层矢量量化隐变量的感知低分辨率视频压缩方法，适用于带宽受限场景。	VQ-VAE spatiotemporal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
19	GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction	GaMO：基于几何感知的多视角扩散外绘用于稀疏视角3D重建	geometric consistency	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页