cs.CV(2025-09-04)

📊 共 29 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗3) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios 提出面向图像的自适应数据集构建方法,应对真实世界多模态安全场景挑战 large language model multimodal
2 Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model UniPic 2.0:通过在线强化学习构建Kontext模型,实现统一多模态图像生成与编辑 multimodal instruction following
3 TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection 提出TRUST-VL,一个可解释的多模态新闻助手,用于检测通用多模态虚假信息。 multimodal
4 SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation 提出SliceSemOcc以解决3D语义占用预测中的高度信息不足问题 multimodal
5 Promptception: How Sensitive Are Large Multimodal Models to Prompts? Promptception框架揭示多模态大模型对提示词的敏感性,并提出优化原则。 multimodal
6 Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection 提出MMChange,一种融合图像与文本差异增强的多模态遥感变化检测网络。 multimodal
7 A Generative Foundation Model for Chest Radiography ChexGen:用于胸部X光片的生成式基础模型,提升医疗AI性能与公平性 foundation model
8 Efficient Odd-One-Out Anomaly Detection 提出一种高效的基于DINO的奇数项异常检测模型,在保持性能的同时显著降低参数量和训练时间。 large language model multimodal
9 ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning 提出ANTS:利用MLLM理解和推理自适应地塑造负文本空间,提升OOD检测性能。 large language model multimodal
10 VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision 提出VisioFirm以解决计算机视觉标注效率低下问题 foundation model
11 SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation SPECS:用于长图像描述评估的特异性增强CLIP-Score large language model
12 Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems 揭示视觉语言模型在跨书写系统中的盲点:对可见但不可读文本的脆弱性 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
13 PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting 提出PromptEnhancer,通过思维链提示重写增强文本到图像生成模型。 reinforcement learning chain-of-thought
14 VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation VCMamba:融合卷积与多向Mamba,实现高效视觉表征 Mamba SSM state space model
15 SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification 提出空间感知相关多示例学习(SAC-MIL)用于病理全切片图像分类。 SAC spatial relationship
16 OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction OccTENS:通过时序下一尺度预测实现可控、高效的3D occupancy世界模型 world model spatial relationship
17 3D and 4D World Modeling: A Survey 对3D和4D世界建模与生成进行全面综述,填补了该领域系统性研究的空白。 world model occupancy grid
18 Guideline-Consistent Segmentation via Multi-Agent Refinement 提出一种基于多智能体迭代优化的无训练语义分割框架,实现指南一致性分割 reinforcement learning open-vocabulary open vocabulary
19 Few-step Flow for 3D Generation via Marginal-Data Transport Distillation 提出MDT-dist以解决3D生成模型的采样效率问题 distillation
20 MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition 提出MICACL框架,解决长尾动态面部表情识别中的类别不平衡和时空建模难题。 contrastive learning
21 Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection 提出FocusMamba,通过RGB-Event协同Token稀疏化实现高效目标检测 Mamba multimodal

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
22 DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset DVS-PedX:用于事件相机行人检测与意图分析的合成与真实数据集 sim-to-real multimodal
23 Human Motion Video Generation: A Survey 全面综述人体运动视频生成技术,涵盖关键阶段、模态及大语言模型应用。 motion planning large language model
24 Weakly-Supervised Learning of Dense Functional Correspondences 提出一种弱监督学习方法,用于学习密集的函数对应关系,提升跨类别图像匹配性能。 manipulation contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
25 SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer 提出SSGaussian,通过语义感知和结构保持实现3D风格迁移 3D gaussian splatting gaussian splatting splatting
26 LMVC: An End-to-End Learned Multiview Video Coding Framework 提出LMVC端到端多视角视频编码框架,提升压缩效率并保证兼容性。 scene reconstruction

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
27 Aesthetic Image Captioning with Saliency Enhanced MLLMs 提出ASE-MLLM,通过显著性增强多模态大语言模型提升图像美学描述生成效果 ASE large language model multimodal
28 EGTM: Event-guided Efficient Turbulence Mitigation 提出基于事件相机的EGTM框架,高效消除大气湍流影响,实现高质量图像复原。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking 提出局部到全局图像检索框架,结合高效局部搜索与有效全局重排序,显著提升检索性能。 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页