cs.CV(2025-10-22)

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (8 🔗2) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis 提出PICK框架,利用多模态大语言模型进行基于绘画的心理分析,提升专家级推理能力。 reinforcement learning large language model multimodal
2 MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation 提出MobiAct:一种基于MobileNetV4、对比学习和知识蒸馏的高效MAV动作识别框架 contrastive learning distillation
3 Transformed Multi-view 3D Shape Features with Contrastive Learning 提出基于对比学习的Transformer多视角3D形状特征提取方法 representation learning contrastive learning
4 Unified Reinforcement and Imitation Learning for Vision-Language Models 提出统一强化与模仿学习(RIL)算法,用于训练轻量级视觉-语言模型。 reinforcement learning imitation learning
5 SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models SCoPE VLM:面向高效文档导航的视觉语言模型选择性上下文处理 reinforcement learning multimodal
6 From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction 提出策略世界模型PWM,用于协同状态-动作预测,提升自动驾驶规划能力 world model
7 HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking 提出分层非对称蒸馏(HAD)框架,弥合事件相机目标跟踪中的时空差异。 distillation
8 PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation 提出PRGCN,利用图记忆网络实现跨序列人体姿态模式复用,提升3D人体姿态估计精度。 Mamba spatiotemporal
9 Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning 提出零外部依赖的增强时刻检索框架AMR,解决数据稀疏、边界模糊和语义区分不足问题。 curriculum learning distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
10 VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction VGD:用于前馈环视驾驶场景重建的视觉几何高斯溅射 gaussian splatting splatting scene reconstruction
11 Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses 提出基于梯度的3DGS滤波方法,解决极端视角下新视角合成的伪影问题 3D gaussian splatting 3DGS gaussian splatting
12 Advances in 4D Representation: Geometry, Motion, and Interaction 针对4D生成与重建,提出基于几何、运动和交互的4D表征方法综述。 3D gaussian splatting 3DGS gaussian splatting
13 A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP 提出一种基于EfficientNet和CLIP的无训练开放词汇图像分割与识别框架 open-vocabulary open vocabulary
14 Toward A Better Understanding of Monocular Depth Evaluation 提出单目深度估计评估新指标,提升与人类感知的对齐性 depth estimation monocular depth
15 AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields AegisRF:利用敏感度引导的对抗扰动保护NeRF的知识产权 NeRF neural radiance field
16 A Matter of Time: Revealing the Structure of Time in Vision-Language Models 提出TIME10k基准,揭示视觉-语言模型中时间信息的低维非线性结构,并构建时间轴表示。 open-vocabulary open vocabulary multimodal
17 Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization 针对人群定位中尺度偏移问题,提出因果特征解耦和异构处理方法,提升领域泛化能力。 scene understanding

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
18 DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents DaMo:用于手机Agent多模态LLM微调的数据混合优化器 large language model multimodal
19 Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory? 揭示多模态模型中的“模态失语症”现象,即视觉记忆准确但文本描述失败 multimodal
20 I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs 利用视觉搜索行为测试评估多模态大语言模型(MLLM)的视觉感知能力 large language model multimodal
21 Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation 提出Decomposed Attention Fusion (DecAF),用于MLLM的免训练视频推理分割 large language model multimodal
22 Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts 利用LLM和RAG自动化宗教木刻图像的Iconclass大规模分类 large language model
23 FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking FutrTrack:一种用于3D多目标跟踪的相机-激光雷达融合Transformer multimodal
24 Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing 提出Pico-Banana-400K大规模数据集,促进文本引导图像编辑研究 multimodal
25 CARES: Context-Aware Resolution Selector for VLMs 提出CARES上下文感知分辨率选择器,降低VLM计算成本并保持性能 multimodal

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
26 Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes 提出MV-RoboBench基准,评估视觉-语言模型在机器人场景中的多视角空间推理能力 manipulation embodied AI vision-language-action
27 Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets Seed3D 1.0:提出从单张图像生成高质量、可用于物理仿真的3D资产的框架。 manipulation embodied AI foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
28 Is This Tracker On? A Benchmark Protocol for Dynamic Tracking 提出ITTO:一个用于动态点跟踪的全新基准测试协议,聚焦真实场景挑战。 egocentric
29 PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis PoseCrafter:利用混合视频合成增强极端位姿估计 feature matching

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
30 OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation OmniMotion-X:多功能多模态全身运动生成框架,实现逼真可控的交互式长时运动。 text-to-motion motion generation SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页