cs.CV（2025-10-13）

📊 共 52 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (23 🔗8) 支柱三：空间感知与语义 (Perception & Semantics) (10 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (3) 支柱一：机器人控制 (Robot Control) (3) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

#	题目	一句话要点	标签	🔗
1	AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model	AndesVL：面向移动端的高效多模态大语言模型，实现性能与效率的平衡	large language model multimodal	✅
2	InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models	InternSVG：利用多模态大语言模型实现统一的SVG任务处理	large language model multimodal
3	FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models	提出FlexAC以解决多模态大语言模型的关联推理灵活性问题	large language model multimodal	✅
4	A Survey on Agentic Multimodal Large Language Models	综述Agentic多模态大语言模型，探索其在动态环境中的智能涌现与应用	large language model multimodal	✅
5	BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models	BLEnD-Vis：构建多模态文化理解基准，评估视觉语言模型在文化知识上的鲁棒性。	multimodal visual grounding	✅
6	CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images	提出CodePlot-CoT，通过代码驱动图像的思维链解决数学视觉推理难题	large language model multimodal chain-of-thought	✅
7	ExpVid: A Benchmark for Experiment Video Understanding & Reasoning	ExpVid：用于评估多模态大语言模型在科学实验视频理解与推理能力的新基准	large language model multimodal visual grounding
8	MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis	提出MS-Mix，通过情感感知的Mixup增强方法提升多模态情感分析性能。	multimodal	✅
9	Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping	基准测试基础模型用于高光谱图像分类，应用于谷类作物类型mapping	foundation model
10	How many samples to label for an application given a foundation model? Chest X-ray classification study	研究胸部X光片分类任务中，如何利用预训练模型减少标注样本需求	foundation model
11	A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images	提出一种基于大语言模型的扫描电镜图像比例尺自动检测与提取框架	large language model
12	CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation	CoPRS：提出基于思维链的位置先验学习方法，用于提升推理分割任务的性能与可解释性	chain-of-thought	✅
13	Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning	提出SynTrans框架，利用大型多模态模型协同知识迁移提升少样本学习性能	multimodal
14	Mixup Helps Understanding Multimodal Video Better	提出多模态Mixup方法，解决多模态视频理解中模态过拟合问题	multimodal
15	IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment	提出IVEBench以解决指令引导视频编辑评估不足问题	large language model multimodal
16	ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?	提出ODI-Bench基准测试MLLM在全景图像理解中的能力，并提出Omni-CoT方法。	large language model chain-of-thought
17	video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM	提出video-SALMONN S，通过测试时训练增强长时音频-视频流式LLM的记忆能力	large language model multimodal
18	GIR-Bench: Versatile Benchmark for Generating Images with Reasoning	GIR-Bench：用于评估图像生成模型推理能力的综合基准	large language model multimodal	✅
19	COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models	提出COCO-Tree，利用神经符号概念树增强视觉语言模型中的组合推理能力	large language model chain-of-thought
20	Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation	提出IUT-Plug，通过显式结构化记忆解决图文交错生成中的多模态上下文漂移问题。	multimodal symbolic grounding
21	EvoCAD: Evolutionary CAD Code Generation with Vision Language Models	EvoCAD：利用视觉语言模型与进化算法生成CAD代码	large language model
22	Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts	提出CLIP-SAM协同与级联提示的两阶段框架，提升零样本异常检测性能。	foundation model
23	FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model	提出FG-CLIP 2，用于提升英汉双语细粒度视觉-语言对齐能力	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗
24	PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image	提出PhySIC框架以解决单图像重建3D人类场景交互问题	monocular depth scene understanding physically plausible
25	VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment	VA-GS：通过视角对齐增强高斯溅射的几何表示，提升表面重建精度。	3D gaussian splatting gaussian splatting splatting	✅
26	MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference	MaterialRefGS：提出多视角一致材质推断的反射高斯溅射方法	gaussian splatting splatting
27	DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects	DKPMV：基于多视角RGB图像的稠密关键点融合，用于无纹理物体6D位姿估计	6D pose estimation
28	Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams	Ev4DGS：基于单目事件流的非刚性物体新视角渲染	3D gaussian splatting gaussian splatting splatting
29	Evaluating the effects of preprocessing, method selection, and hyperparameter tuning on SAR-based flood mapping and water depth estimation	研究SAR图像洪水制图与水深估计中预处理、方法选择和超参数调优的影响	depth estimation
30	A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation	提出一种低成本训练数据生成框架，利用扩散模型提升城市语义分割性能。	scene understanding semantic map
31	SNAP: Towards Segmenting Anything in Any Point Cloud	SNAP：提出一种通用的点云交互式分割模型，支持多领域和多种提示方式。	open-vocabulary open vocabulary	✅
32	REACT3D: Recovering Articulations for Interactive Physical 3D Scenes	REACT3D：用于交互式物理3D场景的铰接结构恢复框架	scene understanding
33	mmWalk: Towards Multi-modal Multi-view Walking Assistance	mmWalk：面向盲人或低视力人群的多模态多视角步行辅助数据集与基准	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗
34	High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation	提出基于全局-局部状态空间模型的高分辨率时空建模方法，用于视频人体姿态估计。	Mamba state space model human motion
35	G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation	提出G2L框架，通过知识蒸馏将千亿级病理模型能力迁移至百亿级模型，实现癌症特异性任务的性能提升。	distillation foundation model
36	Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning	提出Vlaser，通过协同具身推理弥合VLM推理与VLA策略学习的鸿沟。	policy learning vision-language-action VLA
37	Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos	提出基于类原型对比学习的多标签细粒度教育视频分类方法	contrastive learning multimodal	✅
38	Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning	提出Chart-RVR框架，通过可验证奖励强化学习提升LVLM在图表推理中的鲁棒性和可解释性	reinforcement learning chain-of-thought
39	Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment	提出RALI，通过对比学习对齐图像和文本表征，实现高效通用的图像质量评估。	reinforcement learning contrastive learning
40	Topological Alignment of Shared Vision-Language Embedding Space	提出ToMCLIP，通过拓扑对齐解决多语言CLIP模型跨模态对齐偏差问题	representation learning multimodal
41	Source-Free Object Detection with Detection Transformer	提出FRANCK框架，通过查询中心特征增强实现DETR的无源域目标检测。	contrastive learning distillation

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签	🔗
42	Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model	提出Situat3DChange数据集，用于多模态大语言模型理解情境化3D场景变化	egocentric large language model multimodal
43	FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding	FastHMR：通过Token和层合并及扩散解码加速人体网格重建	human mesh recovery HMR
44	ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training	ACE-G：通过查询预训练提升场景坐标回归的泛化能力	feature matching
45	Robust Ego-Exo Correspondence with Long-Term Memory	提出基于长时记忆的LM-EEC框架，解决Ego-Exo对应中的视角差异和遮挡问题。	egocentric	✅

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签
46	MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps	提出基于运动图（MoMap）的语义感知场景运动生成方法，实现单图预测未来3D场景运动。	motion generation motion prediction
47	Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers	提出Detail Guidance，通过调控Diffusion Transformer中的大规模激活提升局部细节生成质量。	classifier-free guidance
48	LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference	LikePhys：通过似然偏好评估视频扩散模型中的直观物理理解能力	physically plausible

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签
49	Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View	提出一种类别无关的单视图物体位姿、尺寸和形状估计框架。	manipulation embodied AI foundation model
50	CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization	提出CoDefend，通过扩散模型净化和Prompt优化协同防御多模态大模型	manipulation large language model multimodal
51	Zero-shot Face Editing via ID-Attribute Decoupled Inversion	提出ID属性解耦反演的零样本人脸编辑方法，解决ID保持和结构一致性问题	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
52	Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment	DiPro：通过时空解耦和多尺度对齐进行多模态疾病进展建模	spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-10-13）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理