cs.CV(2025-07-29)

📊 共 31 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗9) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱一:机器人控制 (Robot Control) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning 提出MMAT-1M:一个大规模多模态Agent Tuning推理数据集,用于提升多模态大模型的推理和工具使用能力。 large language model multimodal chain-of-thought
2 Automated Label Placement on Maps via Large Language Models 提出基于大语言模型的地图自动标注方法,解决人工标注效率低下的问题。 large language model foundation model
3 ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval ArtSeek:通过多模态上下文推理和延迟交互检索实现深度艺术品理解 large language model multimodal
4 Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs Aether Weaver:提出一种动态场景图驱动的多模态情感叙事协同生成框架。 large language model multimodal
5 MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces MAGE:通过桥接视觉和语义空间,增强多模态对齐和生成能力 large language model multimodal
6 Meta CLIP 2: A Worldwide Scaling Recipe Meta CLIP 2:提出一种全球范围扩展CLIP训练的有效方法 large language model foundation model multimodal
7 Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment 提出基于注意力机制的多模态对齐网络,用于长期动作质量评估。 multimodal
8 Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance 提出Chain-of-Cooking模型,通过双向CoT指导实现烹饪过程可视化 chain-of-thought
9 From Waveforms to Pixels: A Survey on Audio-Visual Segmentation 音频-视觉分割综述:全面回顾问题、方法与未来趋势 foundation model multimodal
10 AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock 综述性论文:深度学习在农业领域作物、渔业和畜牧业中的应用 foundation model multimodal
11 EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO EMIT:通过难度感知GRPO增强MLLM在工业异常检测中的性能 large language model multimodal
12 Temporally Consistent Unsupervised Segmentation for Mobile Robot Perception 提出Frontier-Seg,用于移动机器人视频流中时序一致的无监督地形分割 foundation model
13 CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding CAPE:结合CLIP感知的互补热图线索集成,用于具身引用理解 multimodal
14 MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning 提出MSGCoOp框架,通过多语义引导上下文优化提升小样本学习泛化能力。 large language model
15 AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion 提出AU-LLM,首次利用LLM进行微表情动作单元检测,显著提升性能。 large language model
16 Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking 提出SSTrack自监督跟踪框架,通过解耦时空一致性学习提升跟踪性能。 TAMP
17 Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval 提出DAC框架,利用CLIP和MLLM增强开放集3D物体检索能力 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
18 From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning 提出S2E框架,通过强化学习提升导航基础模型在真实城市场景中的交互性和安全性。 reinforcement learning 3DGS foundation model
19 Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images Cardiac-CLIP:用于3D心脏CT图像的视觉-语言基础模型 representation learning masked autoencoder MAE
20 Multimodal Video Emotion Recognition with Reliable Reasoning Priors 提出基于可靠推理先验的多模态视频情感识别框架,提升类不平衡场景性能 contrastive learning multimodal
21 TARS: MinMax Token-Adaptive Preference Strategy for MLLM Hallucination Reduction TARS:一种MinMax Token自适应偏好策略,用于降低MLLM的幻觉问题 DPO direct preference optimization large language model
22 SmartCLIP: Modular Vision-language Alignment with Identification Guarantees 提出SmartCLIP以解决视觉与文本对齐信息不一致问题 contrastive learning multimodal
23 Cross-Architecture Distillation Made Simple with Redundancy Suppression 提出冗余抑制蒸馏(RSD),简化跨架构知识蒸馏并提升效率。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
24 Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos Ov3R:基于RGB视频的开放词汇语义3D重建框架 open-vocabulary open vocabulary
25 TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras TESPEC:时序增强的事件相机自监督预训练框架,提升事件数据理解能力 depth estimation monocular depth
26 EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation EIFNet:利用事件-图像融合实现鲁棒的语义分割 scene understanding
27 PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction 提出PanoSplatt3R以解决无姿态广基线全景重建问题 depth estimation
28 Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection 提出一种选择性跨模态融合框架SMFNet,用于RGB-D视频显著性目标检测,有效利用运动和深度信息。 optical flow

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Impact of Underwater Image Enhancement on Feature Matching 提出水下图像增强评估框架,提升水下SLAM等应用中的特征匹配稳定性。 feature matching

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
30 PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking PRISM:利用图像序列操作进行程序化推理,实现LVLM的越狱攻击 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
31 HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels HunyuanWorld 1.0:提出一种从文本或图像生成沉浸式、可探索和交互式3D世界的新框架 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页