cs.CV（2025-10-22）

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (8 🔗2) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis	提出PICK框架，利用多模态大语言模型进行基于绘画的心理分析，提升专家级推理能力。	reinforcement learning large language model multimodal
2	MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation	提出MobiAct：一种基于MobileNetV4、对比学习和知识蒸馏的高效MAV动作识别框架	contrastive learning distillation
3	Transformed Multi-view 3D Shape Features with Contrastive Learning	提出基于对比学习的Transformer多视角3D形状特征提取方法	representation learning contrastive learning
4	Unified Reinforcement and Imitation Learning for Vision-Language Models	提出统一强化与模仿学习(RIL)算法，用于训练轻量级视觉-语言模型。	reinforcement learning imitation learning
5	SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models	SCoPE VLM：面向高效文档导航的视觉语言模型选择性上下文处理	reinforcement learning multimodal
6	From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction	提出策略世界模型PWM，用于协同状态-动作预测，提升自动驾驶规划能力	world model	✅
7	HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking	提出分层非对称蒸馏（HAD）框架，弥合事件相机目标跟踪中的时空差异。	distillation
8	PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation	提出PRGCN，利用图记忆网络实现跨序列人体姿态模式复用，提升3D人体姿态估计精度。	Mamba spatiotemporal
9	Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning	提出零外部依赖的增强时刻检索框架AMR，解决数据稀疏、边界模糊和语义区分不足问题。	curriculum learning distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
10	VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction	VGD：用于前馈环视驾驶场景重建的视觉几何高斯溅射	gaussian splatting splatting scene reconstruction
11	Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses	提出基于梯度的3DGS滤波方法，解决极端视角下新视角合成的伪影问题	3D gaussian splatting 3DGS gaussian splatting	✅
12	Advances in 4D Representation: Geometry, Motion, and Interaction	针对4D生成与重建，提出基于几何、运动和交互的4D表征方法综述。	3D gaussian splatting 3DGS gaussian splatting	✅
13	A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP	提出一种基于EfficientNet和CLIP的无训练开放词汇图像分割与识别框架	open-vocabulary open vocabulary
14	Toward A Better Understanding of Monocular Depth Evaluation	提出单目深度估计评估新指标，提升与人类感知的对齐性	depth estimation monocular depth	✅
15	AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields	AegisRF：利用敏感度引导的对抗扰动保护NeRF的知识产权	NeRF neural radiance field	✅
16	A Matter of Time: Revealing the Structure of Time in Vision-Language Models	提出TIME10k基准，揭示视觉-语言模型中时间信息的低维非线性结构，并构建时间轴表示。	open-vocabulary open vocabulary multimodal	✅
17	Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization	针对人群定位中尺度偏移问题，提出因果特征解耦和异构处理方法，提升领域泛化能力。	scene understanding

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
18	DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents	DaMo：用于手机Agent多模态LLM微调的数据混合优化器	large language model multimodal	✅
19	Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?	揭示多模态模型中的“模态失语症”现象，即视觉记忆准确但文本描述失败	multimodal
20	I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs	利用视觉搜索行为测试评估多模态大语言模型(MLLM)的视觉感知能力	large language model multimodal
21	Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation	提出Decomposed Attention Fusion (DecAF)，用于MLLM的免训练视频推理分割	large language model multimodal	✅
22	Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts	利用LLM和RAG自动化宗教木刻图像的Iconclass大规模分类	large language model
23	FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking	FutrTrack：一种用于3D多目标跟踪的相机-激光雷达融合Transformer	multimodal
24	Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing	提出Pico-Banana-400K大规模数据集，促进文本引导图像编辑研究	multimodal
25	CARES: Context-Aware Resolution Selector for VLMs	提出CARES上下文感知分辨率选择器，降低VLM计算成本并保持性能	multimodal

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes	提出MV-RoboBench基准，评估视觉-语言模型在机器人场景中的多视角空间推理能力	manipulation embodied AI vision-language-action
27	Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets	Seed3D 1.0：提出从单张图像生成高质量、可用于物理仿真的3D资产的框架。	manipulation embodied AI foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Is This Tracker On? A Benchmark Protocol for Dynamic Tracking	提出ITTO：一个用于动态点跟踪的全新基准测试协议，聚焦真实场景挑战。	egocentric
29	PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis	PoseCrafter：利用混合视频合成增强极端位姿估计	feature matching

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation	OmniMotion-X：多功能多模态全身运动生成框架，实现逼真可控的交互式长时运动。	text-to-motion motion generation SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页