cs.CV(2025-07-25)

📊 共 33 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models 提出LISA,通过层级集成与抑制缓解多模态大语言模型中的幻觉问题 large language model multimodal visual grounding
2 Object-centric Video Question Answering with Visual Grounding and Referring 提出基于视觉定位和指代的面向对象视频问答VideoLLM模型 large language model multimodal visual grounding
3 ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions 提出ChartM$^3$基准,用于评估多模态指令下的图表编辑能力,并构建训练集提升模型性能。 large language model multimodal
4 A Survey of Multimodal Hallucination Evaluation and Detection 综述多模态幻觉评估与检测方法,涵盖图像到文本和文本到图像生成任务。 large language model multimodal
5 DeepJIVE: Learning Joint and Individual Variation Explained from Multimodal Data Using Deep Learning DeepJIVE:提出一种基于深度学习的多模态数据联合与个体差异解释方法 multimodal
6 BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving BEV-LLM:利用多模态BEV地图进行自动驾驶场景描述 multimodal
7 BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection BridgeNet:用于桥接2D和3D工业异常检测的统一多模态框架 multimodal
8 Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding 利用视听优势,探究自然场景下大脑多模态融合的神经编码机制。 multimodal
9 MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment MedIQA:用于提示驱动的医学图像质量评估的可扩展基础模型 foundation model
10 MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents MMBench-GUI:用于GUI智能体的分层多平台评估框架,提升自动化效率。 visual grounding
11 Multistream Network for LiDAR and Camera-based 3D Object Detection in Outdoor Scenes 提出MuStD网络,融合LiDAR和RGB数据,提升室外场景3D目标检测精度。 multimodal
12 Closing the Modality Gap for Mixed Modality Search 提出GR-CLIP以消除CLIP在混合模态搜索中的模态差异 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
13 Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception 提出视频自蒸馏单图像编码器,提升物理可信感知能力 world model distillation optical flow
14 Efficient Learning for Product Attributes with Compact Multimodal Models 提出基于DPO的半监督精调方法,提升电商产品属性预测中紧凑型多模态模型的效率。 DPO direct preference optimization multimodal
15 PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction 提出PRE-MAP模型,通过个性化强化学习眼动追踪多模态LLM,实现高分辨率多属性注视点预测。 reinforcement learning multimodal
16 SP-Mamba: Spatial-Perception State Space Model for Unsupervised Medical Anomaly Detection SP-Mamba:用于无监督医学异常检测的空间感知状态空间模型 Mamba state space model
17 PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction PatchTraj:通过动态patches和时频联合表示学习进行轨迹预测 representation learning egocentric spatiotemporal
18 MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition 提出多粒度层级融合Transformer(MGHFT)用于跨模态表情包情感识别 contrastive learning large language model multimodal
19 GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution GPSMamba:结合全局相位与频谱引导的Mamba红外图像超分辨率方法 Mamba SSM
20 Back to the Features: DINO as a Foundation for Video World Models DINO-world:基于DINOv2特征的通用视频世界模型,用于未来帧预测。 world model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
21 OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models 提出OVFact,用于评估和提升长文本描述模型的开放词汇事实性 open-vocabulary open vocabulary visual grounding
22 DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations DINO-SLAM:DINO特征增强RGB-D SLAM,用于神经隐式和显式表达 3D gaussian splatting 3DGS gaussian splatting
23 GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting GS-Occ3D:利用高斯溅射实现可扩展的纯视觉 occupancy 重建 gaussian splatting splatting
24 Fast Learning of Non-Cooperative Spacecraft 3D Models through Primitive Initialization 提出基于CNN的3DGS快速初始化方法,用于非合作航天器三维模型重建 3D gaussian splatting 3DGS gaussian splatting
25 Gaussian Set Surface Reconstruction through Per-Gaussian Optimization 提出GSSR,通过高斯优化实现高精度高斯集表面重建,提升场景编辑能力。 3D gaussian splatting 3DGS gaussian splatting
26 DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering DASH:基于自监督分解的4D哈希编码,用于实时动态场景渲染 gaussian splatting splatting scene reconstruction
27 ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment ScenePainter:通过概念关系对齐实现语义一致的永久3D场景生成 scene reconstruction ConceptGraphs

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
28 Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues 提出BF-PIP,利用时序视频和多模态提示实现零样本行人意图预测 spatiotemporal multimodal
29 CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing CircuitProbe:通过电路追踪剖析LVLMs中的时空视觉语义 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
30 PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups 提出PINO以解决多角色交互生成的复杂性问题 motion generation penetration multi-person interaction

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
31 Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching 提出跨时空融合注意力机制CSTF,解决遥感图像跨模态匹配中的特征描述难题。 feature matching multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
32 Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene 提出事件驱动框架,在3D场景中生成多人交互的动态故事 human-scene interaction large language model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
33 Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation 提出Face2VoiceSync,解决文本驱动下的轻量级人脸语音同步生成问题 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页