cs.CV(2025-10-20)

📊 共 39 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (9) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗2) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues 提出MT-Video-Bench,用于评估多模态LLM在多轮对话中的视频理解能力 large language model multimodal
2 $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs VisiPruner:解码多模态LLM中的非连续跨模态动态,实现高效剪枝 large language model multimodal
3 Towards a Generalizable Fusion Architecture for Multimodal Object Detection 提出FMCAF架构,提升多模态目标检测的泛化能力与鲁棒性 multimodal
4 Glyph: Scaling Context Windows via Visual-Text Compression Glyph:通过视觉-文本压缩扩展大语言模型的上下文窗口 large language model multimodal
5 Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention 提出基于分层交错块注意力(HIBA)的Xihe,用于可扩展的零样本时间序列学习。 foundation model zero-shot transfer
6 iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA 提出iDETEX,赋能多模态大语言模型实现智能、详细、可解释的图像质量评估 large language model multimodal
7 SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference SparseVILA:解耦视觉稀疏性,加速高效VLM推理 multimodal
8 Elastic ViTs from Pretrained Models without Retraining 提出SnapViT,无需重训练即可从预训练ViT模型中获得弹性计算能力。 foundation model
9 ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input ImaGGen:基于语言和图像输入的零样本共语语义手势生成 multimodal
10 Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization 提出一种上下文感知伪标签评分的零样本视频摘要框架,提升LLM在视频摘要任务中的性能。 large language model
11 Monitoring Horses in Stalls: From Object to Event Detection 提出基于YOLOv11和BoT-SORT的马厩马匹行为监测系统,实现事件自动检测。 foundation model
12 Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs 提出基于循环注意力的Token选择方法,用于高效的流式视频-LLM large language model
13 Exploring The Missing Semantics In Event Modality 提出Semantic-E2VID,利用视觉语义知识增强事件到视频的重建效果 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
14 UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action UltraCUA:融合GUI操作与高级工具的计算机使用Agent基础模型 reinforcement learning foundation model
15 Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model 提出IC-MoE模型,通过智能通信混合专家网络提升医学图像分割基础模型性能。 contrastive learning foundation model
16 Closed-Loop Transfer for Weakly-supervised Affordance Grounding 提出LoopTrans闭环框架,用于弱监督可供性区域定位,提升复杂交互场景性能。 distillation affordance egocentric
17 CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference CausalMamba:用于神经因果推断的可扩展条件状态空间模型 Mamba state space model
18 Token-Level Inference-Time Alignment for Vision-Language Models 提出TITA:一种用于视觉-语言模型Token级推理时对齐的轻量级框架 DPO direct preference optimization multimodal
19 World-in-World: World Models in a Closed-Loop World World-in-World:首个闭环世界模型基准平台,用于评估具身智能体的预测感知能力。 world model
20 Online In-Context Distillation for Low-Resource Vision Language Models 提出在线上下文蒸馏方法,提升低资源视觉语言模型性能。 distillation
21 SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries SparseWorld:基于稀疏动态查询的灵活高效4D Occupancy世界模型 world model
22 GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image GACO-CAD:通过几何增强与简洁性优化,从单张图像生成CAD模型 reinforcement learning large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
23 From Volume Rendering to 3D Gaussian Splatting: Theory and Applications 综述3D高斯溅射:从体渲染到应用,解决实时渲染与高质量重建难题 3D gaussian splatting 3DGS gaussian splatting
24 Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions Raindrop GS:提出雨滴环境下3D高斯溅射重建的综合评测基准 3D gaussian splatting 3DGS gaussian splatting
25 Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models 提出PnF,利用多模态大语言模型增强现有运动预测模型,无需微调。 scene understanding large language model multimodal
26 Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS 提出更强的初始化流程ItG-GS,显著提升稀疏视角3DGS的渲染质量。 3D gaussian splatting 3DGS gaussian splatting
27 PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception PAGE-4D:解耦姿态与几何信息的动态场景VGGT-4D感知 depth estimation VGGT
28 Towards 3D Objectness Learning in an Open World 提出OP3Det,解决开放世界中无文本提示的通用3D目标检测问题 open-vocabulary open vocabulary foundation model
29 HouseTour: A Virtual Real Estate A(I)gent HouseTour:提出一种利用扩散模型生成空间感知三维相机轨迹和自然语言摘要的方法,用于房地产场景。 3D gaussian splatting gaussian splatting splatting
30 DeepDetect: Learning All-in-One Dense Keypoints DeepDetect:提出一种融合经典检测器优势的端到端密集关键点检测方法 visual odometry

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
31 GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation GSPlane:通过结构化表示实现简洁而精确的平面重建 manipulation gaussian splatting splatting
32 SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving SafeCoop:针对基于自然语言协同驾驶的全栈安全防御框架 manipulation
33 ConsistEdit: Highly Consistent and Precise Training-free Visual Editing ConsistEdit:提出一种高一致性和精确性的免训练视觉编辑方法 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
34 ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy ManzaiSet:一个用于分析观众对日本漫才反应的多模态数据集 HuMoR multimodal
35 Leveraging AV1 motion vectors for Fast and Dense Feature Matching 利用AV1运动矢量实现快速密集特征匹配,提升SfM效率 feature matching

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
36 ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues ViBED-Net:利用人脸和场景时空线索进行视频参与度检测 spatiotemporal
37 MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models MUG-V 10B:面向大规模视频生成模型的高效训练框架 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
38 Capturing Head Avatar with Hand Contacts from a Monocular Video 提出一种单目视频头部Avatar重建方法,解决手部交互形变建模问题 penetration spatial relationship

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
39 ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling ShapeCraft:利用LLM Agent生成结构化、纹理化和交互式3D模型 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页