cs.CV(2025-04-18)
📊 共 23 篇论文 | 🔗 8 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (7 🔗4)
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2)
支柱一:机器人控制 (Robot Control) (1 🔗1)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning | CheXWorld:构建放射影像世界模型,提升表征学习能力 | world model representation learning foundation model | ✅ | |
| 10 | LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models | LoftUp:学习基于坐标的特征上采样器,提升视觉基础模型像素级理解能力 | distillation foundation model | ✅ | |
| 11 | Compile Scene Graphs with Reinforcement Learning | 提出R1-SGG,利用强化学习编译场景图,显著提升多模态大语言模型在场景图生成任务上的性能。 | reinforcement learning large language model multimodal | ✅ | |
| 12 | CytoFM: The first cytology foundation model | 提出CytoFM,首个细胞学自监督预训练模型,提升细胞学图像分析性能。 | distillation foundation model | ||
| 13 | U-Shape Mamba: State Space Model for faster diffusion | 提出U型Mamba(USM),加速扩散模型并提升图像生成质量。 | Mamba state space model | ||
| 14 | WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion | WeatherGen:提出基于 Spider Mamba Diffusion 的统一多样天气 LiDAR 点云生成框架 | Mamba contrastive learning | ✅ | |
| 15 | VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment | VideoPASTA:通过7K偏好对齐提升视频-语言模型时空理解能力 | direct preference optimization spatial relationship |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization | 提出链式证据多模态推理方法,用于小样本时序动作定位。 | large language model multimodal | ✅ | |
| 17 | Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation | 提出Fashion-RAG,通过检索增强生成实现多模态时尚图像编辑。 | multimodal | ||
| 18 | SatelliteCalculator: A Multi-Task Vision Foundation Model for Quantitative Remote Sensing Inversion | 提出SatelliteCalculator,用于遥感定量反演的多任务视觉基础模型 | foundation model | ||
| 19 | Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety | 提出一种多智能体视觉-语言系统,用于自动驾驶中零样本新颖危险物体检测。 | large language model multimodal | ✅ | |
| 20 | Zero-Shot Industrial Anomaly Segmentation with Image-Aware Prompt Generation | 提出IAP-AS,通过图像感知提示生成实现工业异常分割的零样本学习。 | large language model | ||
| 21 | Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction | Mono3R:利用单目线索增强几何三维重建,提升弱纹理和低光照场景性能。 | foundation model |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images | DanceText:一种免训练的分层框架,用于图像中可控的多语言文本转换。 | manipulation | ✅ |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | Analysing the Robustness of Vision-Language-Models to Common Corruptions | 分析视觉-语言模型在常见图像损坏下的鲁棒性,揭示Transformer的频率偏置。 | PULSE |