cs.CV（2025-09-29）

📊 共 66 篇论文 | 🔗 24 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (25 🔗8) 支柱二：RL算法与架构 (RL & Architecture) (16 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (15 🔗7) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (3) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (25 篇)

#	题目	一句话要点	标签	🔗
1	FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology	FishNet++：评估多模态大语言模型在海洋生物学中的能力，并构建大规模基准数据集。	large language model multimodal
2	MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment	提出MMRQA框架，融合信号处理与多模态大语言模型，提升MRI质量评估	large language model multimodal
3	Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation	提出ViPET-ReportGen数据集与基准，促进越南语PET/CT报告生成医学视觉-语言基础模型研究。	foundation model multimodal	✅
4	EVLF-FM: Explainable Vision Language Foundation Model for Medicine	提出EVLF-FM，一种具备可解释性的医学视觉语言基础模型，用于多疾病诊断和视觉问答。	foundation model multimodal visual grounding
5	LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models	LLM-RG：利用大语言模型实现户外场景下的指代表达式定位	large language model chain-of-thought
6	GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs	GHOST：通过诱导幻觉的图像生成方法，用于压力测试多模态LLM	large language model multimodal
7	Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding	提出LayerCD，通过层对比解码缓解多模态LLM中的幻觉问题	large language model multimodal	✅
8	OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding	OIG-Bench：提出多智能体标注的多模态单图指南理解评测基准	large language model multimodal	✅
9	Vision Function Layer in Multimodal LLMs	揭示多模态LLM视觉功能层，实现高效可定制模型	large language model multimodal
10	Multimodal Arabic Captioning with Interpretable Visual Concept Integration	VLCAP：一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架	multimodal
11	VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning	VideoAnchor：通过强化子空间结构视觉线索实现连贯的视觉-空间推理	large language model multimodal visual grounding	✅
12	A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration	提出FFDP框架，实现前所未有的十亿体素多模态图像配准	multimodal
13	Robust Multimodal Semantic Segmentation with Balanced Modality Contributions	提出EQUISeg，通过平衡模态贡献提升多模态语义分割的鲁棒性	multimodal
14	Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models	提出Uni-X架构，通过两端分离结构缓解多模态统一模型中的模态冲突问题	multimodal	✅
15	Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection	提出Forensic-Chat框架，解决多模态大语言模型在伪造图像检测中泛化性和可解释性不足的问题。	large language model multimodal
16	PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images	PixelCraft：用于结构化图像高保真视觉推理的多智能体系统	large language model multimodal	✅
17	VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning	提出VT-FSL框架，利用LLM桥接视觉与文本，提升小样本学习性能。	large language model multimodal	✅
18	Environment-Aware Satellite Image Generation with Diffusion Models	提出环境感知扩散模型，用于生成高质量、环境相关的卫星图像。	foundation model multimodal
19	FreeRet: MLLMs as Training-Free Retrievers	提出FreeRet框架以实现无训练的多模态检索	large language model multimodal
20	Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks	提出Euclid30K数据集并微调视觉语言模型，显著提升其空间感知与推理能力	large language model multimodal	✅
21	UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark	提出UI2V-Bench以解决图像到视频生成的语义理解问题	large language model multimodal
22	VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models	VISOR++：基于通用视觉输入的视觉语言模型行为引导方法	multimodal
23	Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents	CogniGPT：交互式多粒度线索探索框架，用于高效长视频理解	large language model
24	Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models	提出无训练的令牌剪枝方法以降低视觉语言模型的推理成本	multimodal
25	Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency	提出QL-Adapter，解决多对象图像编辑中数量和布局一致性问题	instruction following

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗
26	LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model	LUMA：基于双路锚定的低维统一运动对齐文本到动作扩散模型	contrastive learning motion diffusion model motion diffusion
27	Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy	Vid-LLM：提出一种基于视频的紧凑型3D多模态LLM，实现重建-推理协同	distillation metric depth scene understanding
28	DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation	提出DAM，利用多模态基础模型进行无源域自适应的双重主动学习。	distillation foundation model multimodal
29	BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation	提出基于强化学习的深度图到图像生成引擎BRIDGE，用于单目深度估计。	reinforcement learning depth estimation monocular depth	✅
30	VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding	VTPerception-R1：通过显式视觉和文本感知增强多模态推理	reinforcement learning large language model multimodal	✅
31	Latent Visual Reasoning	提出潜在视觉推理(LVR)，实现视觉嵌入空间内的自回归推理，提升视觉问答性能。	reinforcement learning large language model multimodal
32	Score Distillation of Flow Matching Models	将Score Distillation成功应用于Flow Matching模型，实现快速高质量图像生成。	flow matching distillation	✅
33	Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs	提出SCPO框架，通过语义课程偏好优化缓解多模态大语言模型中的视觉幻觉问题	DPO direct preference optimization large language model
34	UI-UG: A Unified MLLM for UI Understanding and Generation	UI-UG：统一的多模态大语言模型，用于用户界面理解与生成	DPO direct preference optimization large language model	✅
35	Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning	Geo-R1：通过跨视角强化学习解锁视觉语言模型中的地理空间推理能力	reinforcement learning chain-of-thought	✅
36	Visual Jigsaw Post-Training Improves MLLMs	Visual Jigsaw：通过视觉拼图后训练提升多模态大语言模型	reinforcement learning large language model multimodal	✅
37	Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis	提出AVF-MAE++，通过可扩展的音视频掩码自编码器高效分析情感视频面部，并在多个基准测试中达到SOTA。	masked autoencoder MAE
38	REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport	REALIGN：基于正则化融合部分Gromov-Wasserstein最优传输的程序视频对齐方法	representation learning contrastive learning egocentric
39	Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning	提出基于跨模态融合注意力和自监督多事件表征学习的事件相机人脸关键点对齐方法	representation learning
40	Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks	提出基于知识蒸馏的双异构学生网络，用于通用多类别异常检测。	distillation
41	Rolling Forcing: Autoregressive Long Video Diffusion in Real Time	提出 Rolling Forcing，实现实时自回归长视频扩散生成，显著降低误差累积。	world model distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (15 篇)

#	题目	一句话要点	标签	🔗
42	GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction	GEM：基于3D高斯溅射的冷冻电镜高效精确重建框架	3D gaussian splatting 3DGS gaussian splatting	✅
43	CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D	CORE-3D：通过3D嵌入和上下文感知实现开放词汇检索	scene understanding semantic mapping semantic map
44	Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh	Proxy-GS：利用代理网格实现高效的3D高斯溅射，提升渲染速度与质量	3D gaussian splatting 3DGS gaussian splatting
45	Triangle Splatting+: Differentiable Rendering with Opaque Triangles	Triangle Splatting+：提出基于不透明三角形的可微渲染方法，实现高效网格重建与新视角合成。	3D gaussian splatting 3DGS gaussian splatting	✅
46	VGGT-X: When VGGT Meets Dense Novel View Synthesis	VGGT-X：针对密集场景的新视角合成，提升3D基础模型性能。	3DGS NeRF VGGT	✅
47	Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation	提出分类器为中心的自适应框架，提升开放词汇伪装目标分割性能	open-vocabulary open vocabulary
48	Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos	Forge4D：提出一种前馈4D人体重建与插值方法，解决稀疏视角视频的快速重建和新视角合成问题。	optical flow motion prediction TAMP	✅
49	GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification	GaussianLens：基于按需高斯致密化的局部高分辨率重建	3D gaussian splatting 3DGS gaussian splatting
50	HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping	HBSplat：基于混合损失引导深度和双向扭曲的鲁棒稀疏视图高斯重建	depth estimation 3D gaussian splatting 3DGS	✅
51	DepthLM: Metric Depth From Vision Language Models	DepthLM：利用视觉语言模型实现度量深度估计，无需修改架构或损失函数。	depth estimation metric depth
52	ExGS: Extreme 3D Gaussian Compression with Diffusion Priors	ExGS：利用扩散先验实现极端3D高斯压缩，兼顾高压缩率与高质量渲染	3D gaussian splatting 3DGS gaussian splatting	✅
53	Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection	提出TaSe框架，通过解耦和分层聚合语言表示，提升语言引导的目标检测性能。	open-vocabulary open vocabulary multimodal
54	LVT: Large-Scale Scene Reconstruction via Local View Transformers	提出局部视图Transformer（LVT）用于大规模场景重建和新视角合成。	scene reconstruction	✅
55	Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots	提出Social 3D Scene Graphs，用于交互式服务机器人理解人类行为与关系	scene understanding open-vocabulary open vocabulary
56	PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos	PAD3R：从单目视频中进行姿态感知的动态3D重建	scene understanding

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗
57	Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events	提出快速特征场（F³），用于事件相机数据的预测性表征，实现高效的场景理解与运动估计。	quadruped depth estimation metric depth
58	FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation	提出FreeAction，通过无训练方法提升轨迹到视频生成中机器人视频的真实度	manipulation world model classifier-free guidance
59	SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics	SCOPE：利用语义条件进行Sim2Real机器人类别级物体姿态估计	manipulation sim2real	✅
60	NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding	NeoWorld：通过渐进式3D展开实现可探索虚拟世界的神经模拟	manipulation world model representation learning

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签
61	StreamForest: Efficient Online Video Understanding with Persistent Event Memory	提出StreamForest，利用持久事件记忆实现高效的在线视频理解。	spatiotemporal large language model multimodal
62	PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion	PanoWorld-X：基于球形感知视频扩散生成可探索全景世界	spatiotemporal
63	PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement	提出基于物理信息的PHASE-Net，高效准确地进行远程光电容积脉搏波测量。	PULSE

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
64	LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation	LaMoGen：基于拉班动作分析的扩散模型文本到动作生成方法	text-to-motion motion synthesis motion generation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
65	DINOReg: Strong Point Cloud Registration with Vision Foundation Model	DINOReg：利用视觉基础模型实现强大的点云配准	spatial relationship foundation model	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
66	SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs	提出SpinBench以评估视觉语言模型中的空间推理能力	egocentric	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-09-29）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (25 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (15 篇)

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理