cs.CV(2025-10-29)

📊 共 21 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding MMEdge:通过流水线式感知和编码加速端侧多模态推理 multimodal
2 Test-Time Adaptive Object Detection with Foundation Model 提出基于Foundation Model的测试时自适应目标检测方法,无需源数据且突破类别限制。 foundation model
3 Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders 提出堆叠时间注意力模块,增强Video-LLM在视频时序理解能力 large language model multimodal
4 Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures 对比AI架构,解决高山保护区生境和土地覆盖变化检测难题 foundation model multimodal
5 Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling 提出分离-去混响联合建模的AVSE系统,提升复杂场景语音质量 multimodal
6 CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments CAVE:提出真实世界视觉异常检测与解释基准,挑战视觉语言模型的常识推理能力 visual grounding
7 VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations 提出VADB:一个大规模专业标注的多维度视频美学数据库,并构建VADB-Net模型。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
8 D$^2$GS: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction 提出D$^2$GS,一种无需激光雷达的城市场景高精度重建方法。 metric depth gaussian splatting splatting
9 LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation 提出LangHOPS,首个基于MLLM的开放词汇层级物体部件分割框架。 open-vocabulary open vocabulary large language model
10 Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments 提出视觉-语言融合框架,解决真实场景下零样本场景理解难题 scene understanding large language model multimodal
11 SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments SPADE:水下零样本单目深度估计的稀疏自适应深度估计器 depth estimation monocular depth
12 EA3D: Online Open-World 3D Object Extraction from Streaming Videos EA3D:从视频流中在线提取开放世界3D对象,实现几何重建与场景理解 visual odometry scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
13 RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models RT-DETRv4:利用视觉基础模型,无痛提升实时目标检测性能 distillation foundation model
14 AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians 提出基于Atlanta-world引导的隐式结构高斯溅射,实现室内外场景高精度重建。 world model gaussian splatting splatting
15 Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement 提出基于Hilbert扫描Mamba的低光照图像增强方法,提升图像细节和视觉质量 Mamba

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
16 StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA 提出StreamingCoT数据集,用于流视频问答中的时序动态理解和多模态思维链推理。 spatiotemporal large language model multimodal
17 Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples 提出基于MDP的骨骼动作识别信息样本选择模型,提升有限样本下的识别精度。 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
18 Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection 提出VDRP框架,解决零样本HOI检测中视觉多样性和区域感知问题。 human-object interaction HOI
19 Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer 提出Brain-IT以解决fMRI图像重建的信度问题 interaction transformer

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
20 Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks 综述多模态空间推理大模型,并构建开放基准评测体系 egocentric spatial relationship embodied AI

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
21 Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation 提出扩散驱动的渐进式目标域操控方法,解决无源域自适应问题。 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页