cs.CV(2025-12-21)

📊 共 18 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 $M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models 提出M³-Verse基准,用于评估大型多模态模型在动态场景中理解对象变化的能力。 multimodal
2 Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models Delta-LLaVA:面向token高效的视觉-语言模型,提出Base-then-Specialize对齐方法 large language model multimodal
3 IPCV: Information-Preserving Compression for MLLM Visual Encoders IPCV:面向MLLM视觉编码器的信息保持型压缩框架 large language model multimodal
4 SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback SimpleCall:基于MLLM感知反馈的轻量级无标签图像修复Agent large language model multimodal
5 OpenView: Empowering MLLMs with Out-of-view VQA 提出OpenView以解决多模态大语言模型的视野外理解问题 large language model multimodal
6 In-Context Audio Control of Video Diffusion Transformers 提出ICAC框架,通过掩码3D注意力实现音频驱动的视频扩散Transformer,提升唇音同步和视频质量。 foundation model
7 SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse SmartSight:通过时间注意力坍缩缓解视频大语言模型中的幻觉问题,同时提升视频理解能力 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
8 EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images EcoSplat:一种效率可控的单次前向3D高斯溅射重建方法 3D gaussian splatting 3DGS gaussian splatting
9 Geometric-Photometric Event-based 3D Gaussian Ray Tracing 提出基于事件的几何-光度3D高斯光线追踪,提升事件相机3D重建精度和效率 3D gaussian splatting 3DGS gaussian splatting
10 A Study of Finetuning Video Transformers for Multi-view Geometry Tasks 通过微调视频Transformer,解决多视角几何任务,达到SOTA水平。 depth estimation optical flow foundation model
11 SplatBright: Generalizable Low-Light Scene Reconstruction from Sparse Views via Physically-Guided Gaussian Enhancement SplatBright:基于物理引导的高斯增强实现稀疏视角下低光场景的通用重建 scene reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
12 InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search InSight-o3:通过广义视觉搜索增强多模态基础模型 reinforcement learning foundation model multimodal
13 brat: Aligned Multi-View Embeddings for Brain MRI Analysis 提出brat:一种用于脑部MRI分析的对齐多视图嵌入框架 representation learning feature matching foundation model
14 Enhancing Medical Large Vision-Language Models via Alignment Distillation 提出MEDALIGN框架,通过对齐蒸馏提升医学大视觉语言模型的视觉理解能力 representation learning distillation
15 Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts 提出UniRect统一框架,利用Mamba模型解决图像校正与矩形化问题 Mamba

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
16 EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer EchoMotion:通过双模态扩散Transformer实现统一的人体视频和动作生成 motion generation human motion

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
17 CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis 提出CrashChat,用于多任务交通碰撞视频分析的多模态大语言模型 spatiotemporal large language model multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
18 VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference VizDefender:通过主动定位和意图推断揭示可视化篡改 manipulation large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页