cs.CV(2025-01-09)

📊 共 25 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding ReFocus:通过视觉编辑的思维链实现结构化图像理解 large language model multimodal chain-of-thought
2 Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark 提出EMMA:增强多模态推理基准,评估MLLM在复杂跨模态推理中的能力 large language model multimodal chain-of-thought
3 Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics Atlas:Mayo Clinic、Charité和Aignostics联合提出的新型病理学基础模型 foundation model
4 CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models 提出CellViT++以解决数字病理中细胞分割与分类问题 foundation model
5 LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding LLaVA-Octopus:指令驱动的自适应投影器融合用于视频理解 large language model multimodal
6 V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer 提出V2C-CBM,利用视觉-概念Tokenizer构建高效且可解释的概念瓶颈模型 large language model multimodal
7 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? OVO-Bench:提出在线视频理解基准,评估视频LLM的时间感知能力。 TAMP
8 Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning 对比深度学习模型在SAR图像冰川崩解前沿识别中的性能,揭示其与人工标注的差距。 foundation model
9 Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection 利用大语言模型和视觉-语言模型增强鲁棒的分布外检测能力 large language model
10 A Flexible and Scalable Framework for Video Moment Search 提出SPR框架,解决长视频中高效灵活的排序视频片段检索问题 TAMP
11 Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments 提出SwPC框架,利用Conformal Prediction提升VLM在机器人场景识别中的置信度与准确率。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
12 SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding SEGS-SLAM:提出结构增强的3D高斯溅射SLAM,提升光照真实感映射质量。 3D gaussian splatting gaussian splatting splatting
13 Relative Pose Estimation through Affine Corrections of Monocular Depth Priors 提出基于仿射校正的单目深度先验相对位姿估计方法,显著提升位姿估计精度。 depth estimation monocular depth foundation model
14 Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance Arc2Avatar:基于单张图像和ID引导生成逼真可控的3D头像 3D gaussian splatting 3DGS gaussian splatting
15 A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision 深度学习深度估计综述:系统性回顾单目、双目和多视角方法,并分析数据集与评价指标。 depth estimation scene understanding
16 Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding 提出基于CLIP的动态场景理解系统,提升自动驾驶环境感知能力。 scene understanding
17 Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes 提出光传输感知扩散后验采样,用于单视角三维体积重建 NeRF

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
18 ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark ECBench:提出一个全面的具身认知基准,用于评估多模态大模型在第一视角环境中的理解能力。 egocentric foundation model
19 Optimizing Multitask Industrial Processes with Predictive Action Guidance 提出MMTFRU网络,结合OAMU单元,优化多任务工业流程中的操作指导。 egocentric multimodal
20 Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning 探索婴儿学习机制:发现超越语言输入的隐藏视觉概念 egocentric

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
21 MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification 提出MambaHSI,一种基于Mamba模型的空谱联合高光谱图像分类方法 Mamba HSI
22 Consistent Flow Distillation for Text-to-3D Generation 提出一致性流蒸馏(CFD)方法,提升文本到3D生成的质量和多样性。 distillation
23 FOCUS: Towards Universal Foreground Segmentation 提出FOCUS框架,实现通用前景分割,显著提升多任务性能。 contrastive learning distillation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
24 Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset 提出Motion-X++大规模多模态3D全身人体运动数据集,用于解决现有数据集在精细度和规模上的局限性。 motion generation human mesh recovery multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
25 Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces 提出渐进式视频Tokenizers训练方法,实现时序紧凑的潜在空间表示 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页