cs.CV(2025-07-03)

📊 共 30 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation 提出GCoT,通过注入定位信息提升MLLM在专业视觉任务上的数据效率 distillation large language model multimodal
2 AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models AIGI-Holmes:通过多模态大语言模型实现可解释和泛化的AI生成图像检测 direct preference optimization large language model multimodal
3 Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach 提出基于置信度驱动梯度调制的动态对比双路学习网络,用于多模态人体活动识别 contrastive learning multimodal
4 FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model FMOcc:基于TPV和流匹配的3D Occupancy预测,提升少帧场景下的预测精度 flow matching SSM state space model
5 Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics 提出基于多极子展开的线性注意力机制MANO,用于视觉和物理模拟任务。 linear attention MANO
6 Learning few-step posterior samplers by unfolding and distillation of diffusion models 通过扩散模型展开与蒸馏学习少量步骤的后验采样器 distillation
7 Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy 提出时序感知监督对比学习以解决结肠镜下息肉计数问题 contrastive learning
8 Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection 提出基于数量提示的弱监督对比学习方法,用于移动红外小目标检测。 contrastive learning
9 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation 提出自蒸馏方法,解决视频生成音频任务中部分可见电影语言的难题 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
10 HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars 提出HyperGaussians,用于高保真可动画人脸头像的3D高斯溅射扩展。 3D gaussian splatting 3DGS gaussian splatting
11 LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling LocalDyGS:通过自适应局部隐式特征解耦实现多视角全局动态场景建模 3D gaussian splatting gaussian splatting splatting
12 LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans LiteReality:从RGB-D扫描重建可用于图形渲染的交互式3D场景 scene reconstruction scene understanding
13 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion 提出LangScene-X,通过TriMap视频扩散重建可泛化的3D语言嵌入场景 scene understanding open-vocabulary open vocabulary
14 SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment 提出SIU3R,一种无需特征对齐的同步场景理解与3D重建框架 scene understanding
15 MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details MoGe-2:提出一种精确的单目几何估计模型,可恢复具有度量尺度和清晰细节的场景3D点云。 MoGe
16 Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory Point3R:利用显式空间指针记忆实现流式3D重建 scene reconstruction
17 From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images 提出基于SegFormer的语义分割方法,用于社交媒体图像地震灾害程度评估。 depth estimation
18 Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images Flow-CDNet:一种用于检测双时相图像中慢速和快速变化的新型网络 optical flow

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
19 LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models 提出LaCo,实现多模态大语言模型视觉Token的层间高效压缩。 large language model multimodal
20 SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement SurgVisAgent:用于多功能手术视觉增强的多模态Agent模型 large language model multimodal chain-of-thought
21 From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding 提出HIVE框架,利用多模态叙事理解实现长视频到精彩短视频的自动剪辑 large language model multimodal
22 Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection 提出VisCo攻击,通过图像驱动的上下文注入破解多模态大语言模型 large language model multimodal
23 Prompt learning with bounding box constraints for medical image segmentation 提出基于边界框约束的Prompt Learning方法,用于医学图像分割。 foundation model multimodal
24 Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization 提出基于语言引导和表征对齐的Prompt解耦方法,提升领域泛化能力 large language model foundation model
25 Intelligent Histology for Tumor Neurosurgery 智能组织学:结合人工智能与受激拉曼组织学,革新肿瘤神经外科术中实时分析 foundation model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
26 No time to train! Training-Free Reference-Based Instance Segmentation 提出一种免训练的参考图像实例分割方法,利用语义先验实现高效分割。 feature matching foundation model
27 CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios 提出CrowdTrack数据集以解决复杂场景下行人多目标跟踪问题 first-person view foundation model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
28 DexVLG: Dexterous Vision-Language-Grasp Model at Scale DexVLG:大规模灵巧手视觉-语言-抓取模型,实现指令驱动的部件级抓取 dexterous hand flow matching vision-language-action

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
29 USAD: End-to-End Human Activity Recognition via Diffusion Model with Spatiotemporal Attention 提出USAD,利用扩散模型与时空注意力进行端到端的人体活动识别。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
30 Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning 提出基于外观和社交距离推理的交互动作重建方法,解决复杂场景下人体交互姿态估计难题。 penetration foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页