cs.CV(2025-05-26)

📊 共 48 篇论文 | 🔗 18 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (15 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗4) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
1 What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models 提出DICE,利用多模态大语言模型评估指令引导的图像编辑效果 distillation large language model multimodal
2 From Data to Modeling: Fully Open-vocabulary Scene Graph Generation 提出OvSGTR,实现完全开放词汇场景图生成,突破传统闭集限制。 distillation open-vocabulary open vocabulary
3 Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought 提出Vad-R1,通过感知-认知链式思考实现视频异常推理 reinforcement learning large language model multimodal
4 MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models MMGeoLM:通过难负例对比学习提升大模型在几何场景中的细粒度理解能力 contrastive learning multimodal
5 Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval 提出多模态推理Agent,解决零样本组合图像检索中的误差传播问题 contrastive learning large language model multimodal
6 FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields FruitNeRF++:利用对比学习和神经辐射场实现通用多水果计数 contrastive learning neural radiance field foundation model
7 Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models 通过微调自然域预训练模型提升医学图像分类性能 Mamba MAE foundation model
8 ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers ViTaPEs:用于多模态Transformer中视觉触觉对齐的视觉触觉位置编码 representation learning multimodal
9 ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving ReasonPlan:面向闭环自动驾驶的统一场景预测与决策推理框架 imitation learning large language model multimodal
10 Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Omni-R1:提出基于强化学习的双系统协作框架,解决全模态推理中长时域和像素级理解的冲突。 reinforcement learning foundation model multimodal
11 FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities 提出基于离散流匹配的统一多模态模型FUDOKI,用于视觉理解和图像生成。 reinforcement learning flow matching large language model
12 Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning 提出Ground-R1,通过强化学习激励可解释的视觉推理,无需额外标注。 reinforcement learning chain-of-thought
13 Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling 探索训练无关技巧在基于SDS的文本到3D生成中的应用,优化生成质量。 distillation classifier-free guidance
14 Long-Context State-Space Video World Models 提出基于状态空间模型的长时序视频世界模型,解决视频扩散模型长程依赖问题。 world model SSM
15 VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection VisTA:基于强化学习的视觉工具动态选择框架,提升视觉推理能力 reinforcement learning

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
16 CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic 提出CPathAgent以解决病理图像分析中的可解释性问题 foundation model multimodal
17 Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion 提出MLLM引导的语义校正扩散模型PPAD,解决文图生成中的语义一致性问题 large language model multimodal
18 Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM 提出Dynamic-I2V,利用多模态LLM提升图像到视频生成中的动态性和可控性。 large language model multimodal
19 StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation StyleAR:定制多模态自回归模型,实现风格对齐的文本到图像生成 multimodal instruction following
20 MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness MMPerspective:首个多模态大语言模型透视理解能力综合评测基准 large language model multimodal chain-of-thought
21 PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology PathBench:病理学基础模型全面评测基准,助力精准肿瘤学 foundation model
22 AdaTP: Attention-Debiased Token Pruning for Video Large Language Models AdaTP:面向视频大语言模型的注意力解偏 Token 剪枝 large language model
23 Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models 分析视觉-语言模型中置信度校准问题,提出视觉置信度感知提示方法 large language model multimodal
24 Efficient Multi-modal Long Context Learning for Training-free Adaptation 提出EMLoC,一种无需训练的高效多模态长文本学习方法,用于任务自适应。 large language model multimodal
25 DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data DIPO:利用双状态图像和多样化数据生成可控的铰接3D物体 chain-of-thought
26 Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts 提出Benign-to-Toxic越狱方法,利用良性提示诱导大型视觉语言模型产生有害响应 multimodal
27 HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters HunyuanVideo-Avatar:用于多角色高保真音频驱动的人体动画生成 multimodal
28 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks 提出CoP基准数据集,用于视频到钢琴音乐生成,支持链式演奏步骤对齐。 multimodal
29 Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models 提出原子视觉技能数据集AVSD,用于评估视觉语言模型在基础几何任务上的能力。 multimodal
30 NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification 提出NEXT框架,通过文本调制的多粒度专家混合模型解决多模态物体ReID问题。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
31 CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting 提出CCL-LGS,通过对比码本学习解决3D语言高斯溅射中的跨视角语义不一致问题。 gaussian splatting splatting
32 OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender OB3D:用于全景3D重建的Blender合成数据集,聚焦几何失真挑战。 3D gaussian splatting 3DGS gaussian splatting
33 Sparse2DGS: Sparse-View Surface Reconstruction using 2D Gaussian Splatting with Dense Point Cloud Sparse2DGS:利用稠密点云增强的2D高斯溅射实现稀疏视角下的表面重建 gaussian splatting splatting
34 Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction 提出深度引导的束采样方法,加速可泛化神经辐射场重建。 NeRF neural radiance field
35 Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting 提出Total-Editing以实现可编辑的头像外观、运动和光照 neural radiance field geometric consistency spatiotemporal
36 GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis 提出GoLF-NRT,融合全局上下文与局部几何信息,解决少样本视角合成问题。 NeRF neural radiance field scene reconstruction
37 Weather-Magician: Reconstruction and Rendering Framework for 4D Weather Synthesis In Real Time 基于高斯溅射的实时4D天气合成重建与渲染框架 gaussian splatting splatting
38 ErpGS: Equirectangular Image Rendering enhanced with 3D Gaussian Regularization ErpGS:基于3D高斯正则化的全景图像渲染方法 3DGS NeRF

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
39 In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation In-Context Brush:基于上下文学习的零样本定制化对象插入方法 manipulation multimodal
40 ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image ControlTac:提出力位控制的触觉数据增强框架,解决触觉数据获取难题 manipulation physically plausible
41 Attention! Your Vision Language Model Could Be Maliciously Manipulated 提出VMA:一种针对视觉语言模型的可操控性对抗攻击方法 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
42 PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation 提出PAMD:一个考虑物理合理性的长舞蹈生成扩散模型 motion diffusion model motion diffusion physically plausible
43 MotionPro: A Precise Motion Controller for Image-to-Video Generation MotionPro:用于图像到视频生成的精确运动控制器 motion synthesis

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
44 Agentic 3D Scene Generation with Spatially Contextualized VLMs 提出Agentic 3D场景生成框架,利用空间上下文增强VLM在3D环境中的理解与编辑能力。 spatial relationship embodied AI multimodal
45 VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction VLM-3R:通过3D重建指令微调增强视觉语言模型,实现单目视频的3D空间理解 spatial relationship multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models AniCrafter:通过化身-背景条件化视频扩散模型定制逼真的人体动画 SMPL SMPL-X human motion

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
47 Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset 提出Electrolyzers-HSI高光谱图像数据集,加速电解槽材料回收与分类研究。 HSI multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
48 Structured Initialization for Vision Transformers 提出结构化初始化方法,提升ViT在小数据集上的泛化能力 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页