cs.CV(2025-11-07)

📊 共 27 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱四:生成式动作 (Generative Motion) (3) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments MPRF:利用多模态基础模型,解决极端非结构化环境下的回环检测问题 foundation model multimodal
2 Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective 针对超声视频分割,研究数据特性对SAM2微调的影响,提升分割性能。 foundation model
3 VMDT: Decoding the Trustworthiness of Video Foundation Models 提出VMDT,首个视频模态基础模型可信度统一评估平台,揭示现有模型在安全性、公平性等方面的不足。 foundation model
4 $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models 提出S^2LM,利用大语言模型实现图像语义隐写,突破传统比特级限制。 large language model
5 From Linear Probing to Joint-Weighted Token Hierarchy: A Foundation Model Bridging Global and Cellular Representations in Biomarker Detection 提出JWTH模型,融合全局与细胞表征,提升AI病理标志物检测性能 foundation model
6 A benchmark multimodal oro-dental dataset for large vision-language models 构建大规模多模态牙科数据集,用于提升视觉-语言模型在口腔健康领域的应用 multimodal
7 Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach Role-SynthCLIP:一种角色扮演驱动的多元合成数据方法,提升CLIP模型性能。 large language model multimodal
8 The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2 利用哥白尼卫星数据,实现快速、大范围的灾后建筑物损毁评估。 foundation model
9 LiveStar: Live Streaming Assistant for Real-World Online Video Understanding LiveStar:通过自适应流解码实现实时在线视频理解的直播助手 large language model
10 Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings 通过优化文本嵌入来缓解大型视觉语言模型中的幻觉问题 visual grounding
11 GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder 提出通用贴纸编码器GSE,用于评估贴纸视觉语义相似度,并构建Triple-S基准数据集。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
12 PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization PreResQ-R1:通过解耦偏好-响应策略优化,实现视觉质量评估的细粒度排序和评分强化学习 reinforcement learning large language model multimodal
13 DeepEyesV2: Toward Agentic Multimodal Model DeepEyesV2:面向具身智能的多模态模型,提升工具调用能力 reinforcement learning multimodal
14 Visual Spatial Tuning 提出视觉空间调优(VST)框架,提升视觉语言模型(VLM)的空间感知和推理能力。 reinforcement learning spatial relationship vision-language-action
15 Cross-domain EEG-based Emotion Recognition with Contrastive Learning 提出EmotionCLIP以解决跨域EEG情感识别问题 contrastive learning multimodal
16 MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification MUSE:用于细胞核检测和分类的多尺度密集自蒸馏方法 distillation foundation model
17 TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning 提出TimeSearch-R,通过自验证强化学习进行长视频理解的自适应时序搜索。 reinforcement learning Ego4D
18 Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale 提出Long Grounded Thoughts框架,用于大规模合成高质量视觉推理链数据,提升视觉语言模型性能。 offline RL multimodal
19 Another BRIXEL in the Wall: Towards Cheaper Dense Features 提出BRIXEL,通过知识蒸馏降低密集特征计算成本,提升下游任务性能。 distillation foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
20 CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting CLM:消除3D高斯溅射的GPU内存瓶颈,实现大规模场景渲染 3D gaussian splatting 3DGS gaussian splatting
21 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos 4D3R:提出运动感知神经重建与渲染框架,解决单目视频动态场景的新视角合成问题。 3D gaussian splatting 3DGS gaussian splatting
22 Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges Splatography:稀疏多视角动态高斯溅射,应对电影制作挑战 gaussian splatting splatting
23 No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation 提出PITTA:一种无需姿态估计的、实例感知的单目深度估计测试时自适应框架 depth estimation monocular depth

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
24 Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance Pressure2Motion:提出一种基于地面压力和文本引导的分层人体运动重建算法。 physically plausible human motion
25 Dense Motion Captioning 提出Dense Motion Captioning任务与CompMo数据集,并构建DEMO模型用于3D人体运动理解与描述。 text-to-motion motion generation human motion
26 Learning Fourier shapes to probe the geometric world of deep neural networks 提出基于傅里叶形状的框架,用于探究深度神经网络的几何世界 physically plausible

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
27 DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning 提出DeepForgeSeal,利用潜空间水印和对抗强化学习进行深度伪造检测。 manipulation reinforcement learning

⬅️ 返回 cs.CV 首页 · 🏠 返回主页