cs.CV(2025-05-28)

📊 共 63 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (21 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (21 🔗8) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (21 篇)

#题目一句话要点标签🔗
1 Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation 提出Look & Mark策略,利用眼动注视和边界框提升胸部X光报告生成质量 large language model multimodal
2 Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs Farm-LightSeek:边缘计算驱动的轻量级LLM农业物联网多模态数据分析框架 large language model multimodal
3 Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs 提出多维基准以评估多模态大语言模型的视觉感知能力 large language model multimodal
4 Zero-Shot 3D Visual Grounding from Vision-Language Models 提出SeeGround,利用2D视觉-语言模型实现零样本3D视觉定位 visual grounding
5 HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer HiDream-I1:基于稀疏扩散Transformer的高效图像生成基础模型 foundation model
6 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model 提出3DLLM-Mem,用于具身3D大语言模型中的长期时空记忆建模 large language model
7 YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction YH-MINER:用于自然生态珊瑚礁指标提取的多模态智能系统 multimodal
8 MObyGaze: a film dataset of multimodal objectification densely annotated by experts 提出MObyGaze电影数据集,用于多模态物体化行为分析与量化 multimodal
9 AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring AquaMonitor:用于水生无脊椎动物生物多样性监测的多模态多视角图像序列数据集 multimodal
10 VidText: Towards Comprehensive Evaluation for Video Text Understanding 提出VidText基准,用于全面评估视频文本理解能力,填补现有视频理解benchmark的空白。 multimodal chain-of-thought
11 Thinking with Generated Images 提出基于生成图像的视觉推理方法,提升大模型在复杂视觉任务中的认知能力。 multimodal chain-of-thought
12 CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction 提出ReCAD框架,自动检测并修正CAD程序错误,提升3D对象设计质量。 large language model multimodal
13 Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation 提出Cross-modal RAG,解决文本到图像生成中细粒度知识检索增强问题。 large language model multimodal
14 OSPO: Object-centric Self-improving Preference Optimization for Text-to-Image Generation 提出OSPO:面向对象中心自提升偏好优化,解决文本到图像生成中的对象幻觉问题 large language model multimodal
15 Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task 提出MultiStAR基准以解决抽象视觉推理评估问题 large language model multimodal
16 EdgeVidSum: Real-Time Personalized Video Summarization at the Edge EdgeVidSum:提出一种轻量级的边缘设备实时个性化视频摘要方法 TAMP
17 MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking MAC-Gaze:针对移动端注视追踪的运动感知持续校准方法 multimodal
18 Zero-Shot Vision Encoder Grafting via LLM Surrogates 通过LLM代理实现视觉编码器的零样本嫁接,降低VLM训练成本。 large language model
19 Sherlock: Self-Correcting Reasoning in Vision-Language Models Sherlock:提出一种基于自校正的视觉-语言模型训练框架,提升复杂推理任务性能。 multimodal
20 Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation 提出MultiTalk框架,解决多人对话场景下的音视频生成问题 instruction following
21 Universal Visuo-Tactile Video Understanding for Embodied Interaction 提出VTV-LLM以解决触觉信息整合不足的问题 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (21 篇)

#题目一句话要点标签🔗
22 SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning 提出SAM-R1,利用强化学习和SAM提升多模态图像分割的推理能力 reinforcement learning multimodal
23 Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization 提出Rhet2Pix,通过双层扩散策略优化解决修辞文本到图像生成难题 diffusion policy large language model multimodal
24 SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection 提出SemIRNet,利用知识融合和跨模态相似度检测提升多模态讽刺识别精度 contrastive learning multimodal
25 OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning OmniAD:通过多模态推理检测和理解工业异常 reinforcement learning multimodal
26 Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization 提出一种多模态大模型优化方法,提升自动驾驶场景感知能力。 distillation multimodal
27 IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction VIMTS:利用视觉MAE进行不规则多元时间序列预测,提升模型对缺失数据的鲁棒性。 masked autoencoder MAE foundation model
28 RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting RiverMamba:利用状态空间模型实现全球河流流量和洪水预测 Mamba state space model
29 Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs 提出Context-to-Cue DPO,解决多图MLLM中的幻觉问题,提升多模态理解能力 DPO direct preference optimization large language model
30 GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control GeoDrive:融合3D几何信息的驾驶世界模型,实现精准动作控制 world model geometric consistency
31 cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning Cadrille:基于在线强化学习的多模态CAD重建模型,实现更精确的三维模型生成。 reinforcement learning large language model
32 Improving Contrastive Learning for Referring Expression Counting 提出C-REX对比学习框架,提升指代表达式计数任务的判别表示学习能力 representation learning MAE contrastive learning
33 RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction RICO:通过视觉重建提升图像重述的准确性和完整性 DPO large language model multimodal
34 Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation 提出SRRL:一种自反思强化学习算法,用于扩散模型生成具备推理能力的图像 reinforcement learning chain-of-thought
35 D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples D-Fusion:通过直接偏好优化和视觉一致样本对齐扩散模型 reinforcement learning DPO direct preference optimization
36 CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation 提出CAST框架,通过对比自适应和蒸馏提升半监督实例分割性能。 distillation foundation model
37 Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers Q-VDiT:面向视频生成扩散Transformer的精确量化与蒸馏框架 distillation spatiotemporal
38 Learning World Models for Interactive Video Generation 提出VRAG,通过视频检索增强生成实现交互式长视频生成的世界模型 world model spatiotemporal
39 Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics 提出DAViD,一种基于强化学习的动态感知视频蒸馏方法,优化视频数据集的时序分辨率。 reinforcement learning distillation
40 StateSpaceDiffuser: Bringing Long Context to Diffusion World Models 提出StateSpaceDiffuser,为扩散世界模型引入长时上下文建模能力 world model
41 Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying 提出显式硬负梯度放大方法,提升多模态嵌入学习性能 contrastive learning large language model
42 InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective InfoSAM:基于信息论微调SAM,提升其在特定领域的分割性能 distillation foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
43 Diffusion-Denoised Hyperspectral Gaussian Splatting 提出基于扩散去噪的高光谱高斯溅射方法,实现高光谱场景的三维重建。 3D gaussian splatting 3DGS gaussian splatting
44 CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting 提出CLIPGaussian,实现基于高斯溅射的通用多模态风格迁移 gaussian splatting splatting multimodal
45 Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss 提出HDGS框架,通过级联深度损失学习细粒度几何信息,提升稀疏视角下的splatting效果。 monocular depth 3D gaussian splatting 3DGS
46 A Survey on Training-free Open-Vocabulary Semantic Segmentation 综述:免训练开放词汇语义分割方法研究进展 open-vocabulary open vocabulary foundation model
47 Learning Hierarchical Sparse Transform Coding of 3DGS 提出SHTC:一种稀疏引导的分层变换编码方法,用于高效压缩3DGS模型。 3D gaussian splatting 3DGS gaussian splatting
48 Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation 提出MLMP方法,用于开放词汇语义分割的视觉-语言模型测试时自适应 open-vocabulary open vocabulary
49 Can NeRFs See without Cameras? 提出基于多径信号的NeRF,实现无需相机即可重建室内环境 NeRF neural radiance field
50 Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs 提出E3VQA基准和M3CoT提示方法,融合第一人称和第三人称视角以提升LVLM的场景理解能力 scene understanding egocentric
51 Task-Driven Implicit Representations for Automated Design of LiDAR Systems 提出任务驱动的隐式表达方法,用于激光雷达系统的自动化设计 implicit representation
52 MR.NAVI: Mixed-Reality Navigation Assistant for the Visually Impaired MR.NAVI:面向视障人士的混合现实导航助手 depth estimation scene understanding
53 SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding 提出SPIRAL:一种语义感知的渐进式LiDAR场景生成与理解框架 semantic map
54 On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation 提出几何增强的参数高效微调方法GEM,用于3D场景分割 scene understanding

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
55 A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition 提出基于跳跃扩散的概率残差搜索框架ProbRes,用于开放世界自我中心活动识别。 egocentric
56 Fast Feature Matching of UAV Images via Matrix Band Reduction-based GPU Data Schedule 提出基于矩阵带缩减的GPU数据调度算法,加速无人机图像特征匹配。 feature matching
57 Event-based Egocentric Human Pose Estimation in Dynamic Environment 提出D-EventEgo框架,解决动态环境下基于事件相机的自中心人体姿态估计问题 egocentric

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
58 Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language 提出基于低维属性对齐的视觉-语言工具选择框架,实现高效灵活的工具选择 manipulation multimodal
59 ATI: Any Trajectory Instruction for Controllable Video Generation 提出统一框架以实现可控视频生成的轨迹指令 manipulation
60 FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing 提出FaceEditTalker以解决可控人脸属性编辑问题 manipulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
61 Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming 提出原型嵌入优化方法PeO-HOI,解决直播场景下HOI检测中的对象偏见问题 human-object interaction HOI

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
62 UniMoGen: Universal Motion Generation UniMoGen:一种通用的、骨骼无关的运动生成扩散模型 motion generation character animation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
63 LatentMove: Towards Complex Human Movement Video Generation LatentMove:面向复杂人体运动视频生成的DiT框架 human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页