cs.CV（2025-09-29）

📊 共 60 篇论文 | 🔗 23 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (24 🔗8) 支柱二：RL算法与架构 (RL & Architecture) (14 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (13 🔗6) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (3) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (24 篇)

#	题目	一句话要点	标签	🔗
1	FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology	FishNet++：评估多模态大语言模型在海洋生物学中的能力，并构建大规模多模态基准	large language model multimodal
2	MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment	提出MMRQA框架，融合信号处理与多模态大语言模型，用于MRI质量评估。	large language model multimodal
3	Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation	提出ViPET-ReportGen数据集与基准，用于提升越南语PET/CT报告生成的视觉-语言基础模型性能	foundation model multimodal	✅
4	LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models	LLM-RG：利用大语言模型实现户外场景下的指称对象定位	large language model chain-of-thought
5	GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs	GHOST：通过诱导幻觉的图像生成方法，用于压力测试多模态LLM	large language model multimodal
6	Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding	提出层对比解码(LayerCD)方法，缓解多模态大语言模型中的幻觉问题。	large language model multimodal	✅
7	OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding	提出OIG-Bench基准，用于评估多模态大语言模型对单图引导的理解能力	large language model multimodal	✅
8	Vision Function Layer in Multimodal LLMs	揭示多模态LLM视觉功能层，实现高效可定制的视觉能力	large language model multimodal
9	Multimodal Arabic Captioning with Interpretable Visual Concept Integration	VLCAP：一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架	multimodal
10	VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning	VideoAnchor：通过强化子空间结构视觉线索实现连贯的视觉-空间推理	large language model multimodal visual grounding	✅
11	A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration	提出FFDP框架，实现前所未有的十亿体素多模态图像配准	multimodal
12	Robust Multimodal Semantic Segmentation with Balanced Modality Contributions	提出EQUISeg，通过平衡模态贡献提升多模态语义分割的鲁棒性	multimodal
13	Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models	提出Uni-X模型，通过两端分离架构缓解多模态统一模型中的模态冲突问题	multimodal	✅
14	Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection	提出Forensic-Chat框架，提升多模态大语言模型在伪造图像检测中的泛化性和可解释性。	large language model multimodal
15	PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images	PixelCraft：用于结构化图像高保真视觉推理的多智能体系统	large language model multimodal	✅
16	VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning	提出VT-FSL框架，利用LLM桥接视觉与文本，提升小样本学习性能	large language model multimodal	✅
17	Environment-Aware Satellite Image Generation with Diffusion Models	提出环境感知扩散模型，用于生成高质量、环境相关的卫星图像。	foundation model multimodal
18	FreeRet: MLLMs as Training-Free Retrievers	FreeRet：无需训练，利用MLLM实现强大的多模态检索	large language model multimodal
19	Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks	提出Euclid30K数据集并微调视觉语言模型，显著提升其空间感知与推理能力	large language model multimodal	✅
20	UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark	UI2V-Bench：提出一个基于理解的图生视频生成评测基准，关注语义理解与推理能力。	large language model multimodal
21	VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models	VISOR++：基于通用视觉输入的视觉语言模型行为引导方法	multimodal
22	Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents	CogniGPT：交互式多粒度线索探索，提升长视频理解的效率与可靠性	large language model
23	Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models	提出训练无关的令牌修剪方法以降低视觉语言模型的推理成本	multimodal
24	Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency	提出QL-Adapter，解决多对象图像编辑中数量和布局一致性问题	instruction following

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

#	题目	一句话要点	标签	🔗
25	LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model	LUMA：基于双路锚定的低维统一运动对齐文本到动作扩散模型	contrastive learning motion diffusion model motion diffusion
26	Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy	Vid-LLM：提出一种基于视频的紧凑型3D多模态LLM，实现重建-推理协同	distillation metric depth scene understanding
27	DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation	提出DAM，利用多模态基础模型进行无源域自适应双重主动学习。	distillation foundation model multimodal
28	BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation	提出基于强化学习的深度图到图像生成引擎BRIDGE，用于单目深度估计。	reinforcement learning depth estimation monocular depth	✅
29	VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding	VTPerception-R1：通过显式视觉和文本感知增强多模态推理	reinforcement learning large language model multimodal	✅
30	Score Distillation of Flow Matching Models	将Score Distillation成功应用于Flow Matching模型，实现快速高质量图像生成。	flow matching distillation	✅
31	Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs	提出SCPO框架，通过语义课程偏好优化缓解多模态大语言模型中的视觉幻觉问题	DPO direct preference optimization large language model
32	UI-UG: A Unified MLLM for UI Understanding and Generation	UI-UG：统一的多模态大语言模型，用于用户界面理解与生成	DPO direct preference optimization large language model	✅
33	Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning	Geo-R1：通过跨视角强化学习解锁视觉语言模型中的地理空间推理能力	reinforcement learning chain-of-thought	✅
34	Visual Jigsaw Post-Training Improves MLLMs	Visual Jigsaw：通过视觉拼图后训练提升多模态大语言模型	reinforcement learning large language model multimodal	✅
35	REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport	REALIGN：基于正则化融合偏Gromov-Wasserstein最优传输的程序视频对齐方法	representation learning contrastive learning egocentric
36	Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning	提出基于跨模态融合注意力和自监督多事件表征学习的事件相机人脸关键点对齐方法	representation learning
37	Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks	提出基于知识蒸馏的双异构学生网络，用于通用多类异常检测。	distillation
38	Rolling Forcing: Autoregressive Long Video Diffusion in Real Time	提出Rolling Forcing，实现实时自回归长视频扩散生成，显著降低误差累积。	world model distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

#	题目	一句话要点	标签	🔗
39	GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction	GEM：基于3D高斯溅射的冷冻电镜高效精确重建框架	3D gaussian splatting 3DGS gaussian splatting	✅
40	CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D	CORE-3D：通过3D嵌入和上下文感知，实现开放词汇的3D场景检索	scene understanding semantic mapping semantic map
41	Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh	Proxy-GS：利用代理网格实现高效的3D高斯溅射，提升渲染速度与质量	3D gaussian splatting 3DGS gaussian splatting
42	Triangle Splatting+: Differentiable Rendering with Opaque Triangles	Triangle Splatting+：提出基于不透明三角形的可微渲染方法，实现高效网格重建与新视角合成。	3D gaussian splatting 3DGS gaussian splatting	✅
43	VGGT-X: When VGGT Meets Dense Novel View Synthesis	VGGT-X：针对密集场景的新视角合成，提升3D基础模型性能。	3DGS NeRF VGGT	✅
44	Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation	提出分类器为中心的自适应框架，提升开放词汇伪装目标分割性能	open-vocabulary open vocabulary
45	GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification	GaussianLens：基于按需高斯致密化的局部高分辨率重建	3D gaussian splatting 3DGS gaussian splatting
46	HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping	HBSplat：基于混合损失引导深度和双向扭曲的鲁棒稀疏视角高斯重建	depth estimation 3D gaussian splatting 3DGS	✅
47	DepthLM: Metric Depth From Vision Language Models	DepthLM：利用视觉语言模型实现度量深度估计，无需修改架构或损失函数。	depth estimation metric depth
48	ExGS: Extreme 3D Gaussian Compression with Diffusion Priors	ExGS：利用扩散先验实现极端3D高斯压缩，兼顾高质量渲染	3D gaussian splatting 3DGS gaussian splatting	✅
49	LVT: Large-Scale Scene Reconstruction via Local View Transformers	提出局部视图Transformer(LVT)，用于大规模场景重建和新视角合成。	scene reconstruction	✅
50	Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots	提出Social 3D Scene Graphs，用于交互式服务机器人理解人类行为与关系	scene understanding open-vocabulary open vocabulary
51	PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos	PAD3R：从单目视频中进行姿态感知的动态3D重建	scene understanding

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗
52	Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events	提出快速特征场（F³），用于事件相机数据的预测性表征学习。	quadruped depth estimation metric depth
53	SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics	SCOPE：基于语义条件扩散模型的机器人Sim2Real类别级物体姿态估计	manipulation sim2real	✅
54	NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding	NeoWorld：通过渐进式3D展开实现可探索虚拟世界的神经模拟	manipulation world model representation learning

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签
55	StreamForest: Efficient Online Video Understanding with Persistent Event Memory	提出StreamForest，利用持久事件记忆实现高效的在线视频理解	spatiotemporal large language model multimodal
56	PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion	PanoWorld-X：基于球面感知视频扩散生成可探索全景世界	spatiotemporal
57	PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement	提出PHASE-Net，通过物理驱动的谐波注意力机制实现高效的远程光电容积脉搏波测量。	PULSE

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
58	LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation	提出LaMoGen以解决文本到运动生成中的表达控制问题	text-to-motion motion synthesis motion generation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
59	DINOReg: Strong Point Cloud Registration with Vision Foundation Model	DINOReg：利用视觉基础模型实现强大的点云配准	spatial relationship foundation model	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
60	SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs	提出SpinBench以评估视觉语言模型的空间推理能力	egocentric	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-09-29）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (24 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册