cs.CV(2025-10-30)

📊 共 30 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (7) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles 面向自动驾驶,综述基于多模态LLM/VLM的下一代融合目标检测技术 large language model multimodal
2 OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research OracleAgent:用于甲骨文研究的多模态推理Agent系统 large language model multimodal
3 AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception AD-SAM:微调SAM视觉基础模型,用于自动驾驶感知 foundation model
4 ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection ProstNFound+:利用医学基础模型进行前列腺癌检测的前瞻性研究 foundation model
5 FARM: Fine-Tuning Geospatial Foundation Models for Intra-Field Crop Yield Regression FARM:微调地理空间基础模型,用于田间作物产量回归 foundation model
6 SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation SpinalSAM-R1:用于脊柱CT分割的视觉-语言多模态交互系统 multimodal
7 MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation 提出MoME:一种用于医学影像分割的视觉语言混合专家模型 large language model foundation model
8 WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios WOD-E2E:针对端到端驾驶中长尾场景的Waymo开放数据集 large language model multimodal
9 Semantic Frame Aggregation-based Transformer for Live Video Comment Generation 提出基于语义帧聚合Transformer的直播视频评论生成模型,提升评论相关性。 multimodal
10 OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes OmniX:利用全景生成与感知,生成可用于图形渲染的3D场景 multimodal
11 SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models SteerVLM:通过轻量级激活调控实现视觉语言模型鲁棒控制 multimodal
12 Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition 提出表征级反事实校准方法,解决零样本识别中的上下文偏差问题。 multimodal
13 Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models 提出AoT-PsyPhyBENCH基准,评估视觉-语言模型对视频时间流逝方向的理解能力 multimodal
14 ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts ConceptScope:通过解耦视觉概念表征来量化和识别数据集偏差。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
15 JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting 提出JOGS,联合优化位姿估计和3D高斯溅射,无需预校准输入。 3D gaussian splatting gaussian splatting splatting
16 The Impact and Outlook of 3D Gaussian Splatting 综述3D高斯溅射的最新进展,提升效率、可扩展性和真实感,并探索其数学基础。 3D gaussian splatting 3DGS gaussian splatting
17 DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting 提出基于方向一致性的自适应密度控制方法DC4GS,提升3D高斯溅射的重建质量。 3D gaussian splatting gaussian splatting splatting
18 Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2 利用深度学习光流法,提升RADARSAT-2卫星SAR图像海冰漂移估计的可靠性 optical flow motion estimation
19 HEIR: Learning Graph-Based Motion Hierarchies 提出基于图的层次运动建模方法以解决运动动态建模问题 3D gaussian splatting gaussian splatting splatting
20 AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency MOVAI:提出一种基于AI的高质量文本生成视频框架,提升时间一致性。 scene understanding multimodal
21 MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models MoTDiff:利用扩散模型从单张模糊图像中估计高分辨率运动轨迹 optical flow

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
22 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning ThinkMorph:通过多模态交错CoT推理涌现视觉操作能力 manipulation multimodal chain-of-thought
23 Emu3.5: Native Multimodal Models are World Learners Emu3.5:原生多模态模型,通过预测视觉和语言的下一个状态实现世界理解。 manipulation reinforcement learning world model
24 Self-Improving Vision-Language-Action Models with Data Generation via Residual RL 提出PLD框架,通过残差强化学习和数据生成自提升视觉-语言-动作模型 manipulation reinforcement learning vision-language-action
25 Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving 提出CATG:基于约束流匹配的端到端自动驾驶轨迹生成框架,解决模仿学习模式崩溃问题。 manipulation imitation learning flow matching
26 Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark 评估视频模型零样本推理能力:提出MME-CoF基准并分析Veo-3的局限性 manipulation

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
27 The Quest for Generalizable Motion Generation: Data, Model, and Evaluation 提出ViMoGen框架,通过迁移视频生成知识,提升3D人体动作生成模型的泛化能力。 flow matching motion generation human motion
28 Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning 提出无例增量关系蒸馏框架以解决动态环境中的人机交互检测问题 representation learning distillation human-object interaction
29 EgoExo-Con: Exploring View-Invariant Video Temporal Understanding 提出EgoExo-Con基准与View-GRPO框架,提升视频LLM在不同视角下的一致性时序理解能力 reinforcement learning egocentric

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
30 CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark 提出CRAG-MM:一个用于可穿戴设备场景的多模态多轮对话RAG综合评测基准。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页