cs.CV(2025-06-13)

📊 共 29 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning 提出动态混合课程LoRA专家以解决持续多模态指令调优问题 large language model multimodal
2 DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs 提出DaMO以解决视频语言模型中的时序推理问题 large language model multimodal
3 TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models 提出TAViS以解决音视频分割中的跨模态对齐问题 foundation model multimodal
4 VGR: Visual Grounded Reasoning 提出VGR以解决多模态推理中的语言偏见问题 large language model multimodal chain-of-thought
5 Exploring the Effectiveness of Deep Features from Domain-Specific Foundation Models in Retinal Image Synthesis 提出基于深度特征的损失函数以改进视网膜图像合成 foundation model
6 VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories? 提出VFaith以评估多模态模型的视觉推理能力 multimodal
7 Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation 提出多模态一致性与连贯性增强框架以解决文本-图像计划生成问题 multimodal
8 Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs 提出Manager插件以解决两塔VLMs和MLLMs中的单模态专家聚合问题 large language model multimodal
9 CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection 提出CLIPFUSION以解决异常检测中的多模态融合问题 foundation model
10 Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model 提出LLaVA-NeXT-Interleave以解决多图像推理问题 foundation model
11 A$^2$LC: Active and Automated Label Correction for Semantic Segmentation 提出A$^2$LC框架以解决语义分割中的标签纠错问题 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
12 MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution 提出MambaVSR以解决视频超分辨率中的非局部依赖建模问题 Mamba SSM state space model
13 How Visual Representations Map to Language Feature Space in Multimodal LLMs 提出冻结模型与线性适配器以解决视觉与语言对齐问题 representation learning large language model multimodal
14 InceptionMamba: Efficient Multi-Stage Feature Enhancement with Selective State Space Model for Microscopic Medical Image Segmentation 提出InceptionMamba以解决显微医学图像分割效率问题 Mamba state space model
15 AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments 提出AgentSense以解决智能家居中缺乏多样化标注数据的问题 world model embodied AI large language model
16 Stop learning it all to mitigate visual hallucination, Focus on the hallucination target 提出偏好学习方法以缓解多模态大语言模型的视觉幻觉问题 preference learning large language model multimodal
17 Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation 提出DISCOVR以解决心脏超声视频表示学习问题 representation learning distillation
18 DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning 提出DAVID-XR1以解决AI生成视频检测的可解释性问题 distillation chain-of-thought
19 EasyARC: Evaluating Vision Language Models on True Visual Reasoning 提出EasyARC以解决多模态视觉推理评估问题 reinforcement learning multimodal
20 Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization 提出Auto-Connect以解决自动绑定中骨骼连通性问题 direct preference optimization

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
21 GraphGSOcc: Semantic-Geometric Graph Transformer with Dynamic-Static Decoupling for 3D Gaussian Splatting-based Occupancy Prediction 提出GraphGSOcc以解决3D语义占用预测中的动态静态耦合问题 3D gaussian splatting 3DGS gaussian splatting
22 Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale 提出Affogato以解决开放词汇的可用性定位问题 open-vocabulary open vocabulary affordance
23 OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots 提出OV-MAP以解决开放世界3D实例分割问题 open-vocabulary open vocabulary

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
24 Dynamic Double Space Tower 提出动态双空间塔以解决视觉问答中的推理不足问题 spatial relationship multimodal
25 SphereDrag: Spherical Geometry-Aware Panoramic Image Editing 提出SphereDrag以解决全景图像编辑中的几何问题 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
26 SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation 提出SignAligner以解决手语生成中的多模态协调问题 spatiotemporal multimodal
27 EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment 提出EyeSim-VQA以解决视频质量评估中的自适应修复问题 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
28 EgoPrivacy: What Your First-Person Camera Says About You? 提出EgoPrivacy以评估第一人称视频的隐私风险 egocentric egocentric vision first-person view

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
29 Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving 提出基于三平面的多摄像头高效标记方法以提升自动驾驶性能 motion planning

⬅️ 返回 cs.CV 首页 · 🏠 返回主页