cs.CV（2025-12-23）

📊 共 17 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱一：机器人控制 (Robot Control) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts	NullBUS：通过可空全局-局部提示的多模态混合监督乳腺超声分割	multimodal
2	FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models	FlashVLM：文本引导的视觉Token选择，提升大模型多模态效率	multimodal
3	VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs	VideoScaffold：面向MLLM的弹性尺度视觉层级，用于流式视频理解	large language model multimodal	✅
4	Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference	提出输入自适应视觉预处理方法，提升FastVLM在视觉问答任务中的推理效率。	multimodal	✅
5	SpatialTree: How Spatial Abilities Branch Out in MLLMs	构建SpatialTree，系统评估并提升MLLM的空间认知能力	multimodal
6	Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models	提出DSR Suite和几何选择模块GSM，提升VLM在动态空间推理能力	foundation model
7	Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark	提出NL-DIR基准数据集，用于解决自然语言描述的文档图像检索问题	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
8	Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)	利用3D高斯溅射增强5D苹果姿态估计的标注效率	3D gaussian splatting 3DGS gaussian splatting
9	SirenPose: Dynamic Scene Reconstruction via Geometric Supervision	SirenPose：通过几何监督实现动态场景的精确重建与时序一致性	scene reconstruction physically plausible spatiotemporal
10	AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment	AlignPose：基于多视角特征度量对齐的通用6D位姿估计	6D pose estimation
11	SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images	SmartSplat：提出特征感知的GS图像压缩框架，解决超高分辨率图像的高效压缩与高质量重建问题。	3D gaussian splatting gaussian splatting splatting	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
12	Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition	提出多模态对齐、翻译、融合与迁移方法，提升复杂输入理解与识别能力	distillation egocentric multimodal
13	Active Intelligence in Video Avatars via Closed-loop World Modeling	提出ORCA框架，通过闭环世界建模实现视频化身的主动智能	world model
14	Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture	提出面向智慧农业的高效深度学习多目标混合知识蒸馏框架	distillation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding	融合物理力信息的运动理解：提升步态、动作识别与视频描述性能	human motion

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving	LEAD：最小化端到端驾驶中学习器-专家不对称性，提升CARLA模拟器驾驶性能	sim-to-real imitation learning	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning	提出DETACH框架，通过解耦时空对齐解决外中心视频与环境传感器融合问题	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页