cs.CV（2026-01-12）

📊 共 28 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗2) 支柱一：机器人控制 (Robot Control) (2) 支柱三：空间感知与语义 (Perception & Semantics) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding	VideoLoom：用于联合时空理解的视频大语言模型	large language model multimodal
2	Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models	利用Foundation Model实现结直肠癌肝转移病灶在多中心CT图像上的稳健检测与分类	foundation model
3	A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data	SOPHIAS：一个用于口头报告评估的多模态数据集	multimodal
4	SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model	提出SIRR-LMM，利用大模型解决单图像反射去除问题，并构建高质量合成数据集。	multimodal
5	ShowUI-Aloha: Human-Taught GUI Agent	ShowUI-Aloha：一种基于人类示教的GUI智能体框架	Aloha
6	DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection	提出DIVER：动态迭代视觉证据推理框架，用于多模态虚假新闻检测	multimodal
7	HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression	HiVid-Narrator：提出基于场景的ASR锚定压缩的分层视频叙事生成框架，用于电商视频。	multimodal chain-of-thought
8	Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training	提出DualPD，无需训练即可提升MLLM层间一致性，解决“知行不一”问题	large language model multimodal
9	A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model	提出VISA-Mark：一种基于前缀调优的视觉语义自适应水印方法，用于保护大视觉语言模型的内容版权。	multimodal visual grounding
10	VENUS: Visual Editing with Noise Inversion Using Scene Graphs	VENUS：基于场景图和噪声反演的免训练图像视觉编辑框架	large language model multimodal
11	PARL: Position-Aware Relation Learning Network for Document Layout Analysis	提出PARL：一种位置感知关系学习网络，用于提升文档布局分析性能。	multimodal
12	BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation	BenchSeg：一个大规模多视角食物视频分割数据集与基准	multimodal	✅
13	PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion	PanoSAMic：基于SAM特征编码和双视角融合的全景图像分割	foundation model	✅
14	Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models	提出Focal Guidance以解决视频扩散模型中的语义弱层控制问题	instruction following

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification	提出测试时自适应分层协同增强去噪网络，解决多模态分类中的噪声鲁棒性问题。	representation learning multimodal
16	Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding	提出CINEMA框架，模拟人类认知过程，提升多图推理能力	reinforcement learning large language model multimodal
17	SDHSI-Net: Learning Better Representations for Hyperspectral Images via Self-Distillation	SDHSI-Net：通过自蒸馏学习高光谱图像的更优表征	distillation HSI	✅
18	Variational Contrastive Learning for Skeleton-based Action Recognition	提出变分对比学习框架，提升骨骼动作识别在低标签场景下的性能	representation learning contrastive learning
19	Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training	提出Self-Transcendence方法，仅用内部特征监督加速Diffusion Transformer训练，无需外部指导。	representation learning classifier-free guidance	✅
20	Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model	提出SNRA和AP-GRPO，提升视觉语言模型在3D场景理解中的空间推理能力。	reinforcement learning scene understanding
21	Few-shot Class-Incremental Learning via Generative Co-Memory Regularization	提出生成式协同记忆正则化方法，解决少样本类增量学习难题	masked autoencoder MAE

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Motion Focus Recognition in Fast-Moving Egocentric Video	提出一种快速运动的第一人称视频中的运动焦点实时识别方法	locomotion egocentric vision-language-action
23	SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations	SecureCAI：面向网络安全运营的注入攻击弹性LLM助手	manipulation direct preference optimization large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization	Mon3tr：利用预构建高斯人像的单目3D远程呈现	3D gaussian splatting 3DGS gaussian splatting
25	OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image	OSCAR：一种基于语言提示和单张图像的开放集CAD模型检索方法	scene understanding

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models	GeoMotionGPT：通过几何对齐的运动理解增强大型语言模型	MotionGPT large language model

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	PALUM: Part-based Attention Learning for Unified Motion Retargeting	PALUM：提出基于部件注意力学习的统一运动重定向方法	motion retargeting

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis	PulseMind：用于真实临床诊断的多模态医学模型，解决异构输入和上下文理解难题。	PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页