cs.CV(2026-01-12)

📊 共 28 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗2) 支柱一:机器人控制 (Robot Control) (2) 支柱三:空间感知与语义 (Perception & Semantics) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding VideoLoom:用于联合时空理解的视频大语言模型 large language model multimodal
2 Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models 利用Foundation Model实现结直肠癌肝转移病灶在多中心CT图像上的稳健检测与分类 foundation model
3 A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data SOPHIAS:一个用于口头报告评估的多模态数据集 multimodal
4 SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model 提出SIRR-LMM,利用大模型解决单图像反射去除问题,并构建高质量合成数据集。 multimodal
5 ShowUI-Aloha: Human-Taught GUI Agent ShowUI-Aloha:一种基于人类示教的GUI智能体框架 Aloha
6 DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection 提出DIVER:动态迭代视觉证据推理框架,用于多模态虚假新闻检测 multimodal
7 HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression HiVid-Narrator:提出基于场景的ASR锚定压缩的分层视频叙事生成框架,用于电商视频。 multimodal chain-of-thought
8 Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training 提出DualPD,无需训练即可提升MLLM层间一致性,解决“知行不一”问题 large language model multimodal
9 A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model 提出VISA-Mark:一种基于前缀调优的视觉语义自适应水印方法,用于保护大视觉语言模型的内容版权。 multimodal visual grounding
10 VENUS: Visual Editing with Noise Inversion Using Scene Graphs VENUS:基于场景图和噪声反演的免训练图像视觉编辑框架 large language model multimodal
11 PARL: Position-Aware Relation Learning Network for Document Layout Analysis 提出PARL:一种位置感知关系学习网络,用于提升文档布局分析性能。 multimodal
12 BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation BenchSeg:一个大规模多视角食物视频分割数据集与基准 multimodal
13 PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion PanoSAMic:基于SAM特征编码和双视角融合的全景图像分割 foundation model
14 Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models 提出Focal Guidance以解决视频扩散模型中的语义弱层控制问题 instruction following

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
15 Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification 提出测试时自适应分层协同增强去噪网络,解决多模态分类中的噪声鲁棒性问题。 representation learning multimodal
16 Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding 提出CINEMA框架,模拟人类认知过程,提升多图推理能力 reinforcement learning large language model multimodal
17 SDHSI-Net: Learning Better Representations for Hyperspectral Images via Self-Distillation SDHSI-Net:通过自蒸馏学习高光谱图像的更优表征 distillation HSI
18 Variational Contrastive Learning for Skeleton-based Action Recognition 提出变分对比学习框架,提升骨骼动作识别在低标签场景下的性能 representation learning contrastive learning
19 Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training 提出Self-Transcendence方法,仅用内部特征监督加速Diffusion Transformer训练,无需外部指导。 representation learning classifier-free guidance
20 Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model 提出SNRA和AP-GRPO,提升视觉语言模型在3D场景理解中的空间推理能力。 reinforcement learning scene understanding
21 Few-shot Class-Incremental Learning via Generative Co-Memory Regularization 提出生成式协同记忆正则化方法,解决少样本类增量学习难题 masked autoencoder MAE

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 Motion Focus Recognition in Fast-Moving Egocentric Video 提出一种快速运动的第一人称视频中的运动焦点实时识别方法 locomotion egocentric vision-language-action
23 SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations SecureCAI:面向网络安全运营的注入攻击弹性LLM助手 manipulation direct preference optimization large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
24 Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization Mon3tr:利用预构建高斯人像的单目3D远程呈现 3D gaussian splatting 3DGS gaussian splatting
25 OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image OSCAR:一种基于语言提示和单张图像的开放集CAD模型检索方法 scene understanding

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
26 GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models GeoMotionGPT:通过几何对齐的运动理解增强大型语言模型 MotionGPT large language model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
27 PALUM: Part-based Attention Learning for Unified Motion Retargeting PALUM:提出基于部件注意力学习的统一运动重定向方法 motion retargeting

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
28 PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis PulseMind:用于真实临床诊断的多模态医学模型,解决异构输入和上下文理解难题。 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页