cs.CV(2025-10-29)
📊 共 16 篇论文 | 🔗 5 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (3)
支柱八:物理动画 (Physics-based Animation) (2 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (1)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding | MMEdge:通过流水线式感知与编码加速设备端多模态推理 | multimodal | ||
| 2 | Test-Time Adaptive Object Detection with Foundation Model | 提出基于基础模型的测试时自适应目标检测方法以解决源数据依赖问题 | foundation model | ✅ | |
| 3 | Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders | 提出STAVE,通过在视觉编码器中堆叠时间注意力增强Video-LLM的时间理解能力 | large language model multimodal | ✅ | |
| 4 | Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures | 对比AI架构,解决高山保护区生境和土地覆盖变化检测难题 | foundation model multimodal | ||
| 5 | CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments | CAVE:提出用于检测和解释视觉环境中常识异常的基准数据集。 | visual grounding | ||
| 6 | VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations | 提出VADB数据库与VADB-Net框架以解决视频美学评估问题 | multimodal | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | D$^2$GS: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction | 提出D$^2$GS以解决无LiDAR城市场景重建问题 | metric depth gaussian splatting splatting | ||
| 8 | LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation | LangHOPS:提出一种基于多模态大语言模型的开放词汇分层部件分割框架。 | open-vocabulary open vocabulary large language model | ||
| 9 | SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments | SPADE:一种水下零样本、实时、单目深度估计的稀疏自适应深度估计器 | depth estimation monocular depth |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 10 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | 提出StreamingCoT数据集,用于流视频问答中的时序动态理解和多模态思维链推理。 | spatiotemporal large language model multimodal | ✅ | |
| 11 | Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples | 提出基于MDP的骨骼动作识别信息样本选择模型,提升有限样本下的识别精度。 | spatiotemporal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models | RT-DETRv4:利用视觉基础模型,无痛提升实时目标检测性能 | distillation foundation model | ||
| 13 | Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement | 提出基于Hilbert扫描Mamba的低光图像增强方法,提升细节表现 | Mamba |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks | 综述多模态空间推理大模型,并构建开放基准评测 | egocentric spatial relationship embodied AI | ✅ |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer | 提出Brain-IT,通过脑交互Transformer实现基于fMRI的图像重建,提升重建图像的真实性。 | interaction transformer |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation | 提出扩散驱动的渐进式目标域操控方法,解决无源域自适应中的域差异问题。 | manipulation |