cs.CV（2025-07-29）

📊 共 31 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (17 🔗9) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱一：机器人控制 (Robot Control) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning	提出MMAT-1M：一个大规模多模态Agent Tuning推理数据集，用于提升多模态大模型的推理和工具使用能力。	large language model multimodal chain-of-thought	✅
2	Automated Label Placement on Maps via Large Language Models	提出基于大语言模型的地图自动标注方法，解决人工标注效率低下的问题。	large language model foundation model	✅
3	ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval	ArtSeek：通过多模态上下文推理和延迟交互检索实现深度艺术品理解	large language model multimodal	✅
4	Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs	Aether Weaver：提出一种动态场景图驱动的多模态情感叙事协同生成框架。	large language model multimodal
5	MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces	MAGE：通过桥接视觉和语义空间，增强多模态对齐和生成能力	large language model multimodal	✅
6	Meta CLIP 2: A Worldwide Scaling Recipe	Meta CLIP 2：提出一种全球范围扩展CLIP训练的有效方法	large language model foundation model multimodal
7	Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment	提出基于注意力机制的多模态对齐网络，用于长期动作质量评估。	multimodal
8	Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance	提出Chain-of-Cooking模型，通过双向CoT指导实现烹饪过程可视化	chain-of-thought
9	From Waveforms to Pixels: A Survey on Audio-Visual Segmentation	音频-视觉分割综述：全面回顾问题、方法与未来趋势	foundation model multimodal
10	AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock	综述性论文：深度学习在农业领域作物、渔业和畜牧业中的应用	foundation model multimodal	✅
11	EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO	EMIT：通过难度感知GRPO增强MLLM在工业异常检测中的性能	large language model multimodal
12	Temporally Consistent Unsupervised Segmentation for Mobile Robot Perception	提出Frontier-Seg，用于移动机器人视频流中时序一致的无监督地形分割	foundation model
13	CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding	CAPE：结合CLIP感知的互补热图线索集成，用于具身引用理解	multimodal
14	MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning	提出MSGCoOp框架，通过多语义引导上下文优化提升小样本学习泛化能力。	large language model	✅
15	AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion	提出AU-LLM，首次利用LLM进行微表情动作单元检测，显著提升性能。	large language model	✅
16	Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking	提出SSTrack自监督跟踪框架，通过解耦时空一致性学习提升跟踪性能。	TAMP	✅
17	Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval	提出DAC框架，利用CLIP和MLLM增强开放集3D物体检索能力	large language model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
18	From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning	提出S2E框架，通过强化学习提升导航基础模型在真实城市场景中的交互性和安全性。	reinforcement learning 3DGS foundation model
19	Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images	Cardiac-CLIP：用于3D心脏CT图像的视觉-语言基础模型	representation learning masked autoencoder MAE
20	Multimodal Video Emotion Recognition with Reliable Reasoning Priors	提出基于可靠推理先验的多模态视频情感识别框架，提升类不平衡场景性能	contrastive learning multimodal
21	TARS: MinMax Token-Adaptive Preference Strategy for MLLM Hallucination Reduction	TARS：一种MinMax Token自适应偏好策略，用于降低MLLM的幻觉问题	DPO direct preference optimization large language model
22	SmartCLIP: Modular Vision-language Alignment with Identification Guarantees	提出SmartCLIP以解决视觉与文本对齐信息不一致问题	contrastive learning multimodal	✅
23	Cross-Architecture Distillation Made Simple with Redundancy Suppression	提出冗余抑制蒸馏(RSD)，简化跨架构知识蒸馏并提升效率。	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos	Ov3R：基于RGB视频的开放词汇语义3D重建框架	open-vocabulary open vocabulary
25	TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras	TESPEC：时序增强的事件相机自监督预训练框架，提升事件数据理解能力	depth estimation monocular depth	✅
26	EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation	EIFNet：利用事件-图像融合实现鲁棒的语义分割	scene understanding
27	PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction	提出PanoSplatt3R以解决无姿态广基线全景重建问题	depth estimation	✅
28	Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection	提出一种选择性跨模态融合框架SMFNet，用于RGB-D视频显著性目标检测，有效利用运动和深度信息。	optical flow	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Impact of Underwater Image Enhancement on Feature Matching	提出水下图像增强评估框架，提升水下SLAM等应用中的特征匹配稳定性。	feature matching

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking	PRISM：利用图像序列操作进行程序化推理，实现LVLM的越狱攻击	manipulation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels	HunyuanWorld 1.0：提出一种从文本或图像生成沉浸式、可探索和交互式3D世界的新框架	geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页