cs.CV（2026-02-03）

📊 共 35 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (4) 支柱八：物理动画 (Physics-based Animation) (2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization	QVLA：针对具身控制，提出动作敏感的VLA模型通道量化框架	vision-language-action VLA large language model
2	MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment	提出MM-SCALE数据集，通过标量判断和列表对齐提升多模态道德推理能力	multimodal
3	Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis	提出基于准多模态的病理生理特征学习框架，用于视网膜疾病诊断。	multimodal
4	Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization	提出C3PO框架，通过CoT压缩和对比偏好优化缓解多模态推理模型中的幻觉问题。	multimodal
5	Z3D: Zero-Shot 3D Visual Grounding from Images	提出Z3D，解决仅使用多视角图像的零样本3D视觉定位问题	visual grounding	✅
6	Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases	提出基于Vision Foundation Model的FOCUS框架，实现3D OCT视网膜疾病诊断全流程自动化	foundation model
7	FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion	FSOD-VFM：利用视觉基础模型和图扩散进行少样本目标检测	foundation model	✅
8	FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation	提出FinMTM：一个用于金融推理和Agent评估的多轮多模态基准	multimodal
9	A generalizable large-scale foundation model for musculoskeletal radiographs	SKELEX：用于肌肉骨骼X光片的通用大规模基础模型	foundation model
10	VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering	提出VOILA框架，通过信息价值指导的多模态问答保真度选择，优化资源受限场景。	multimodal
11	PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation	提出PnP-U3D框架，结合自回归与扩散模型，统一3D理解与生成任务。	large language model multimodal
12	Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation	提出Refer-Agent以解决视频对象分割中的推理与反思问题	large language model
13	SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM	提出SlowFocus机制，增强视频LLM对细粒度时序信息的理解能力	large language model
14	Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning	提出LogiCls框架，通过约束分解和指令微调实现可解释的工业图像逻辑异常分类。	chain-of-thought
15	MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration	MUSE：通过闭环认知编排的多智能体框架，用于无约束的故事构想	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation	提出Video-OPD，通过在策略蒸馏高效后训练多模态大语言模型，用于时序视频定位。	reinforcement learning distillation large language model
17	Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning	DualSpeed：通过视觉Token剪枝加速多模态大语言模型训练，解决训练-推理不一致问题。	distillation large language model multimodal	✅
18	Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance	提出DiSCo和Table-GLS框架，高效解决LVLM在表格推理中的结构内容解耦与结构引导问题。	reinforcement learning multimodal
19	MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning	MedSAM-Agent：利用多轮Agent强化学习增强交互式医学图像分割	reinforcement learning reward design large language model	✅
20	Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction	Socratic-Geo：通过多智能体交互生成合成数据并实现几何推理	preference learning large language model multimodal
21	LIVE: Long-horizon Interactive Video World Modeling	LIVE：通过循环一致性约束实现长时交互视频世界建模	world model distillation
22	From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning	提出协同表示学习(SRL)，解决无监督视频对象中心学习中编码器-解码器表征鸿沟问题。	representation learning	✅
23	A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures	提出EB-JEPA轻量级库，用于能量模型联合嵌入预测架构的学习与应用。	world model representation learning	✅
24	FARTrack: Fast Autoregressive Visual Tracking with High Performance	FARTrack：一种快速自回归视觉跟踪框架，兼顾高性能与高效率。	teacher-student distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Constrained Dynamic Gaussian Splatting	提出约束动态高斯溅射，解决动态场景重建中内存占用过高的问题。	gaussian splatting splatting scene reconstruction
26	Hand3R: Online 4D Hand-Scene Reconstruction in the Wild	Hand3R：提出首个单目视频在线4D手部-场景联合重建框架	scene reconstruction hand reconstruction embodied AI
27	SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation	SharpTimeGS：通过寿命调制实现清晰稳定的动态高斯溅射	gaussian splatting splatting
28	Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models	EvoAug：利用生成模型自动进化任务特定的强大数据增强策略	NeRF

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
29	EventFlash: Towards Efficient MLLMs for Event-Based Vision	EventFlash：面向事件视觉的高效多模态大语言模型，通过时空稀疏化加速推理。	spatiotemporal large language model foundation model
30	Unifying Watermarking via Dimension-Aware Mapping	提出维度感知映射（DiM）框架，统一现有深度水印方法并实现多维度水印功能。	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation	提出3DiMo以解决人类视频生成中的运动控制问题	SMPL human motion
32	FOVI: A biologically-inspired foveated interface for deep vision models	提出FOVI：一种受生物视觉启发的foveated接口，用于高效深度视觉模型。	egocentric egocentric vision	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane	提出Spiral RoPE，解决视觉Transformer中轴向RoPE对斜向空间关系建模的限制。	spatial relationship large language model
34	RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images	RAWDet-7：用于量化RAW图像目标检测与描述的多场景基准数据集	spatial relationship

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	Continuous Control of Editing Models via Adaptive-Origin Guidance	提出AdaOr自适应调整编辑模型，实现文本引导图像/视频编辑的平滑强度控制。	manipulation classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页