cs.CV(2025-10-06)

📊 共 28 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior 提出Pathology-CoT框架,从专家WSI诊断行为中学习视觉链式推理Agent foundation model chain-of-thought
2 ActiveMark: on watermarking of visual foundation models via massive activations ActiveMark:通过大规模激活水印视觉基础模型,实现所有权验证。 foundation model
3 A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification 提出空间-光谱-频率交互网络(S²Fin),用于提升多模态遥感图像分类精度。 multimodal
4 Factuality Matters: When Image Generation and Editing Meet Structured Visuals 提出StructBench基准和统一模型,解决结构化视觉内容生成与编辑中的事实性问题。 multimodal chain-of-thought
5 MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models MedCLM:通过CoT课程学习医学视觉语言模型中的定位和推理 visual grounding chain-of-thought
6 VChain: Chain-of-Visual-Thought for Reasoning in Video Generation VChain:用于视频生成中推理的视觉思维链 multimodal
7 Character Mixing for Video Generation 提出CCE和CCA框架,实现跨世界观角色自然交互的视频生成。 multimodal
8 Visual Representations inside the Language Model 分析多模态大语言模型内部视觉表征,揭示其感知能力瓶颈与改进方向 multimodal
9 Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics 提出基于Transformer的对话姿态识别框架,用于自然交互场景下的人物身份识别。 multimodal
10 ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion 提出Blendshape引导的扩散模型,实现身份保持和精准表情生成。 foundation model
11 Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting VLMCountBench揭示视觉语言模型在组合计数任务上的显著缺陷 embodied AI

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
12 Benchmark on Monocular Metric Depth Estimation in Wildlife Setting 构建野生动物场景下单目深度估计基准,评估现有方法性能。 MAE depth estimation monocular depth
13 Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models 首个Video-LMM后训练综述:深入探讨基于大型多模态模型的视频推理 reinforcement learning reward design spatiotemporal
14 Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction 提出面向对象的表征学习方法,提升3D场景图预测精度 representation learning open-vocabulary open vocabulary
15 Conditional Representation Learning for Customized Tasks 提出条件表示学习(CRL),为定制任务提取特定语义的图像表征。 representation learning large language model
16 A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation 对比ViT与CNN在少样本刚性变换和本质矩阵估计中的性能差异 contrastive learning scene reconstruction foundation model
17 ERDE: Entropy-Regularized Distillation for Early-exit 提出ERDE:一种基于熵正则化的知识蒸馏早期退出方法,提升边缘设备图像分类效率。 distillation
18 Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation 提出AT-BPTT,通过自动内循环优化提升数据集蒸馏性能。 distillation
19 EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents EduPersona:评估虚拟学生Agent主观能力的基准数据集与评测框架 teacher-student large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
20 Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction 提出PG-Occ框架,通过渐进式高斯Transformer实现开放词汇三维 occupancy 预测。 scene understanding open-vocabulary open vocabulary
21 Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning 提出基于有界分布估计的开放词汇学习方法,通过生成未见类数据提升泛化能力 open-vocabulary open vocabulary
22 See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models 提出基于视觉语言模型的时序反演场景重建方法,利用热成像痕迹推断过去场景状态。 scene reconstruction
23 AvatarVTON: 4D Virtual Try-On for Animatable Avatars AvatarVTON:提出首个用于可动画Avatar的4D虚拟试穿框架 optical flow

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
24 General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks 提出基于目标无关掩码的视觉目标条件强化学习方法,提升泛化性和效率 sim-to-real reinforcement learning open-vocabulary
25 Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization 提出一种自动化双机器人扫描系统,用于文化遗产高精度三维数字化 manipulation motion planning

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
26 Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors EgoSurg:利用环境传感器,为手术室工作流程重建任意视角的自我中心回放。 egocentric
27 SegMASt3R: Geometry Grounded Segment Matching SegMASt3R:利用3D基础模型实现几何感知的图像分割匹配 feature matching foundation model

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
28 Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems 在人机社会基础设施中,通过双人互动识别推断社会情境 dyadic interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页