cs.CV(2025-12-17)

📊 共 29 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱八:物理动画 (Physics-based Animation) (3) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning 提出基于多模态CoT推理的可解释动作形态评估方法与数据集,解决动作标准化评估问题。 multimodal chain-of-thought
2 Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning Skyra:通过可信的伪影推理实现AI生成视频检测 large language model multimodal
3 GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models 提出GRAN-TED框架,用于生成鲁棒、对齐和细致的扩散模型文本嵌入。 large language model multimodal
4 EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration EmoCaliber:通过置信度表达与校准,提升视觉情感理解的可靠性 large language model multimodal
5 Step-GUI Technical Report 提出Step-GUI,通过自进化训练和GUI-MCP协议,实现高效、安全、通用的GUI自动化。 large language model multimodal
6 DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models DiffusionVL:将任意自回归模型转化为扩散视觉语言模型,提升性能与推理速度。 multimodal
7 Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics 提出TIMAR,用于建模交互式3D对话头部的因果turn级动态生成。 multimodal
8 Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models 对比分析专用计数架构与视觉-语言模型在视觉枚举任务中的性能 multimodal
9 Uni-Parser Technical Report Uni-Parser:面向科学文献和专利的高通量文档解析引擎 large language model
10 PMMD: A pose-guided multi-view multi-modal diffusion for person generation 提出PMMD框架,通过多视角多模态扩散模型实现姿态引导下的高质量人物生成。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
11 In Pursuit of Pixel Supervision for Visual Pre-training Pixio:基于像素监督的视觉预训练,实现简单、高效且强大的表征学习 masked autoencoder MAE visual pre-training
12 EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence EagleVision:基于BEV的链式思考双阶段框架,提升空间智能 reinforcement learning multimodal chain-of-thought
13 Preserving Marker Specificity with Lightweight Channel-Independent Representation Learning 提出轻量级通道独立表示学习以提升标记特异性 representation learning foundation model
14 MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement 提出MMMamba,一种用于全色锐化和零样本图像增强的跨模态上下文融合框架 Mamba multimodal
15 Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry 提出Grounded Self-Distillation框架,解决单目深度估计中的3D幻觉问题 distillation monocular depth foundation model
16 IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion 提出IMKD,通过强度感知多层知识蒸馏提升雷达-相机融合3D目标检测性能。 distillation
17 MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors MoonSeg3R:利用重建基础先验实现单目在线零样本3D分割 distillation foundation model
18 SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering 提出SMART模型,通过语义匹配对比学习解决部分视图对齐聚类问题 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
19 Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting 提出一种新架构以解决3D高斯原语检测的像素对齐问题 3D gaussian splatting 3DGS gaussian splatting
20 MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance 提出MVGSR,通过极线引导实现多视角一致的3D高斯超分辨率重建 3D gaussian splatting 3DGS gaussian splatting
21 Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering 提出高斯像素编解码头像(GPiCA),用于高效渲染的混合人像表示 3D gaussian splatting gaussian splatting splatting
22 Spatia: Video Generation with Updatable Spatial Memory Spatia:利用可更新空间记忆实现视频生成,提升时空一致性 visual SLAM

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
23 ST-DETrack: Identity-Preserving Branch Tracking in Entangled Plant Canopies via Dual Spatiotemporal Evidence ST-DETrack:利用时空双重证据,解决复杂植物冠层中分支的身份保持跟踪问题 spatiotemporal
24 IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning 提出IC-Effect,通过上下文学习实现精确高效的视频特效编辑 spatiotemporal instruction following
25 Asynchronous Event Stream Noise Filtering for High-frequency Structure Deformation Measurement 提出基于事件相机和LED标记的异步事件流噪声滤波方法,用于高频结构形变测量。 spatiotemporal

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
26 Multi-View Foundation Models 提出多视角基础模型,提升多视角场景下特征一致性 manipulation feature matching foundation model
27 VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics VAAS:用于数字取证中图像篡改检测的视觉注意力异常评分方法 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
28 GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection 提出GateFusion,通过分层门控跨模态融合提升主动说话人检测性能 Ego4D multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
29 DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations DeX-Portrait:通过显式和隐式运动表征实现解耦且富有表现力的人像动画 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页