cs.CV（2025-10-06）

📊 共 28 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior	提出Pathology-CoT框架，从专家WSI诊断行为中学习视觉链式推理Agent	foundation model chain-of-thought
2	ActiveMark: on watermarking of visual foundation models via massive activations	ActiveMark：通过大规模激活水印视觉基础模型，实现所有权验证。	foundation model
3	A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification	提出空间-光谱-频率交互网络(S²Fin)，用于提升多模态遥感图像分类精度。	multimodal	✅
4	Factuality Matters: When Image Generation and Editing Meet Structured Visuals	提出StructBench基准和统一模型，解决结构化视觉内容生成与编辑中的事实性问题。	multimodal chain-of-thought
5	MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models	MedCLM：通过CoT课程学习医学视觉语言模型中的定位和推理	visual grounding chain-of-thought
6	VChain: Chain-of-Visual-Thought for Reasoning in Video Generation	VChain：用于视频生成中推理的视觉思维链	multimodal
7	Character Mixing for Video Generation	提出CCE和CCA框架，实现跨世界观角色自然交互的视频生成。	multimodal	✅
8	Visual Representations inside the Language Model	分析多模态大语言模型内部视觉表征，揭示其感知能力瓶颈与改进方向	multimodal
9	Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics	提出基于Transformer的对话姿态识别框架，用于自然交互场景下的人物身份识别。	multimodal
10	ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion	提出Blendshape引导的扩散模型，实现身份保持和精准表情生成。	foundation model	✅
11	Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting	VLMCountBench揭示视觉语言模型在组合计数任务上的显著缺陷	embodied AI

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
12	Benchmark on Monocular Metric Depth Estimation in Wildlife Setting	构建野生动物场景下单目深度估计基准，评估现有方法性能。	MAE depth estimation monocular depth
13	Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models	首个Video-LMM后训练综述：深入探讨基于大型多模态模型的视频推理	reinforcement learning reward design spatiotemporal	✅
14	Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction	提出面向对象的表征学习方法，提升3D场景图预测精度	representation learning open-vocabulary open vocabulary	✅
15	Conditional Representation Learning for Customized Tasks	提出条件表示学习(CRL)，为定制任务提取特定语义的图像表征。	representation learning large language model	✅
16	A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation	对比ViT与CNN在少样本刚性变换和本质矩阵估计中的性能差异	contrastive learning scene reconstruction foundation model
17	ERDE: Entropy-Regularized Distillation for Early-exit	提出ERDE：一种基于熵正则化的知识蒸馏早期退出方法，提升边缘设备图像分类效率。	distillation
18	Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation	提出AT-BPTT，通过自动内循环优化提升数据集蒸馏性能。	distillation
19	EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents	EduPersona：评估虚拟学生Agent主观能力的基准数据集与评测框架	teacher-student large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction	提出PG-Occ框架，通过渐进式高斯Transformer实现开放词汇三维 occupancy 预测。	scene understanding open-vocabulary open vocabulary	✅
21	Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning	提出基于有界分布估计的开放词汇学习方法，通过生成未见类数据提升泛化能力	open-vocabulary open vocabulary
22	See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models	提出基于视觉语言模型的时序反演场景重建方法，利用热成像痕迹推断过去场景状态。	scene reconstruction
23	AvatarVTON: 4D Virtual Try-On for Animatable Avatars	AvatarVTON：提出首个用于可动画Avatar的4D虚拟试穿框架	optical flow

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks	提出基于目标无关掩码的视觉目标条件强化学习方法，提升泛化性和效率	sim-to-real reinforcement learning open-vocabulary
25	Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization	提出一种自动化双机器人扫描系统，用于文化遗产高精度三维数字化	manipulation motion planning

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors	EgoSurg：利用环境传感器，为手术室工作流程重建任意视角的自我中心回放。	egocentric
27	SegMASt3R: Geometry Grounded Segment Matching	SegMASt3R：利用3D基础模型实现几何感知的图像分割匹配	feature matching foundation model

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems	在人机社会基础设施中，通过双人互动识别推断社会情境	dyadic interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页