cs.CV(2025-09-04)

📊 共 30 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios 提出图像驱动的自适应数据集构建方法,应对真实世界多模态安全场景挑战 large language model multimodal
2 Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model UniPic 2.0:通过在线强化学习构建Kontext模型,实现统一多模态图像生成与编辑 multimodal instruction following
3 TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection 提出TRUST-VL,一个可解释的多模态新闻助手,用于检测通用多模态虚假信息。 multimodal
4 SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation SliceSemOcc:提出基于垂直切片的多模态3D语义占据表示方法,提升小物体识别精度。 multimodal
5 Promptception: How Sensitive Are Large Multimodal Models to Prompts? Promptception:揭示大型多模态模型对提示词的敏感性,并提出稳健评估框架。 multimodal
6 Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection 提出MMChange,一种结合图像和文本模态的遥感变化检测方法,提升精度和鲁棒性。 multimodal
7 A Generative Foundation Model for Chest Radiography ChexGen:用于胸部X光片的生成式基础模型,提升医疗AI性能与公平性 foundation model
8 Efficient Odd-One-Out Anomaly Detection 提出一种高效的基于DINO的奇数项异常检测模型,在保持性能的同时显著降低参数量和训练时间。 large language model multimodal
9 ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning 提出自适应负文本空间塑造方法以解决OOD检测问题 large language model multimodal
10 VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision VisioFirm:一款跨平台AI辅助的计算机视觉标注工具 foundation model
11 SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation SPECS:用于长图像描述评估的特异性增强CLIP-Score large language model
12 Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems 揭示视觉语言模型在跨书写系统中的盲点:对可见但不可读文本的脆弱性 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
13 WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human 提出WATCH框架,解决单目视频中相机和人体全局运动轨迹精确重建问题 world model human motion human motion reconstruction
14 PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting 提出PromptEnhancer,通过思维链提示重写增强文本到图像生成模型。 reinforcement learning chain-of-thought
15 VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation VCMamba:融合卷积与多向Mamba,实现高效视觉表征 Mamba SSM state space model
16 SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification 提出SAC-MIL,利用空间感知相关性多示例学习进行病理全切片图像分类。 SAC spatial relationship
17 OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction OccTENS:通过时序下一尺度预测实现可控、高效的3D occupancy 世界模型生成。 world model spatial relationship
18 3D and 4D World Modeling: A Survey 对3D和4D世界建模与生成进行全面综述,填补了该领域系统性研究的空白。 world model occupancy grid
19 Guideline-Consistent Segmentation via Multi-Agent Refinement 提出多代理精细化框架以解决语义分割中的指导一致性问题 reinforcement learning open-vocabulary open vocabulary
20 Few-step Flow for 3D Generation via Marginal-Data Transport Distillation 提出MDT-dist框架,通过边缘数据传输蒸馏加速3D生成模型的采样过程。 distillation
21 MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition 提出MICACL框架,解决长尾动态面部表情识别中的类别不平衡和时空建模问题。 contrastive learning
22 Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection 提出FocusMamba,通过RGB-Event协同Token稀疏化实现高效目标检测 Mamba multimodal

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
23 Human Motion Video Generation: A Survey 全面综述人体运动视频生成技术,涵盖关键阶段与未来趋势。 motion planning human motion large language model
24 DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset DVS-PedX:用于事件相机行人检测与意图分析的合成与真实数据集 sim-to-real multimodal
25 Weakly-Supervised Learning of Dense Functional Correspondences 提出一种弱监督学习方法,用于学习密集的函数对应关系,提升跨类别图像匹配性能。 manipulation contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
26 SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer 提出SSGaussian,通过语义感知和结构保持实现3D风格迁移 3D gaussian splatting gaussian splatting splatting
27 LMVC: An End-to-End Learned Multiview Video Coding Framework 提出LMVC框架以解决多视角视频编码效率问题 scene reconstruction

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
28 Aesthetic Image Captioning with Saliency Enhanced MLLMs 提出美学显著性增强的多模态大语言模型以解决图像美学描述问题 ASE large language model multimodal
29 EGTM: Event-guided Efficient Turbulence Mitigation 提出基于事件相机的EGTM框架,高效解决大气湍流图像恢复问题 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
30 Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking 提出局部到全局图像检索框架,融合高效局部搜索与有效全局重排序,显著提升检索性能。 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页