cs.CV(2026-02-03)

📊 共 35 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization QVLA:针对具身控制,提出动作敏感的VLA模型通道量化框架 vision-language-action VLA large language model
2 MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment 提出MM-SCALE数据集,通过标量判断和列表对齐提升多模态道德推理能力 multimodal
3 Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis 提出基于准多模态的病理生理特征学习框架,用于视网膜疾病诊断。 multimodal
4 Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization 提出C3PO框架,通过CoT压缩和对比偏好优化缓解多模态推理模型中的幻觉问题。 multimodal
5 Z3D: Zero-Shot 3D Visual Grounding from Images 提出Z3D,解决仅使用多视角图像的零样本3D视觉定位问题 visual grounding
6 Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases 提出基于Vision Foundation Model的FOCUS框架,实现3D OCT视网膜疾病诊断全流程自动化 foundation model
7 FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion FSOD-VFM:利用视觉基础模型和图扩散进行少样本目标检测 foundation model
8 FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation 提出FinMTM:一个用于金融推理和Agent评估的多轮多模态基准 multimodal
9 A generalizable large-scale foundation model for musculoskeletal radiographs SKELEX:用于肌肉骨骼X光片的通用大规模基础模型 foundation model
10 VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering 提出VOILA框架,通过信息价值指导的多模态问答保真度选择,优化资源受限场景。 multimodal
11 PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation 提出PnP-U3D框架,结合自回归与扩散模型,统一3D理解与生成任务。 large language model multimodal
12 Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation 提出Refer-Agent以解决视频对象分割中的推理与反思问题 large language model
13 SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM 提出SlowFocus机制,增强视频LLM对细粒度时序信息的理解能力 large language model
14 Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning 提出LogiCls框架,通过约束分解和指令微调实现可解释的工业图像逻辑异常分类。 chain-of-thought
15 MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration MUSE:通过闭环认知编排的多智能体框架,用于无约束的故事构想 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
16 Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation 提出Video-OPD,通过在策略蒸馏高效后训练多模态大语言模型,用于时序视频定位。 reinforcement learning distillation large language model
17 Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning DualSpeed:通过视觉Token剪枝加速多模态大语言模型训练,解决训练-推理不一致问题。 distillation large language model multimodal
18 Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance 提出DiSCo和Table-GLS框架,高效解决LVLM在表格推理中的结构内容解耦与结构引导问题。 reinforcement learning multimodal
19 MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning MedSAM-Agent:利用多轮Agent强化学习增强交互式医学图像分割 reinforcement learning reward design large language model
20 Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction Socratic-Geo:通过多智能体交互生成合成数据并实现几何推理 preference learning large language model multimodal
21 LIVE: Long-horizon Interactive Video World Modeling LIVE:通过循环一致性约束实现长时交互视频世界建模 world model distillation
22 From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning 提出协同表示学习(SRL),解决无监督视频对象中心学习中编码器-解码器表征鸿沟问题。 representation learning
23 A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures 提出EB-JEPA轻量级库,用于能量模型联合嵌入预测架构的学习与应用。 world model representation learning
24 FARTrack: Fast Autoregressive Visual Tracking with High Performance FARTrack:一种快速自回归视觉跟踪框架,兼顾高性能与高效率。 teacher-student distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
25 Constrained Dynamic Gaussian Splatting 提出约束动态高斯溅射,解决动态场景重建中内存占用过高的问题。 gaussian splatting splatting scene reconstruction
26 Hand3R: Online 4D Hand-Scene Reconstruction in the Wild Hand3R:提出首个单目视频在线4D手部-场景联合重建框架 scene reconstruction hand reconstruction embodied AI
27 SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation SharpTimeGS:通过寿命调制实现清晰稳定的动态高斯溅射 gaussian splatting splatting
28 Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models EvoAug:利用生成模型自动进化任务特定的强大数据增强策略 NeRF

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
29 EventFlash: Towards Efficient MLLMs for Event-Based Vision EventFlash:面向事件视觉的高效多模态大语言模型,通过时空稀疏化加速推理。 spatiotemporal large language model foundation model
30 Unifying Watermarking via Dimension-Aware Mapping 提出维度感知映射(DiM)框架,统一现有深度水印方法并实现多维度水印功能。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
31 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation 提出3DiMo以解决人类视频生成中的运动控制问题 SMPL human motion
32 FOVI: A biologically-inspired foveated interface for deep vision models 提出FOVI:一种受生物视觉启发的foveated接口,用于高效深度视觉模型。 egocentric egocentric vision

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
33 Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane 提出Spiral RoPE,解决视觉Transformer中轴向RoPE对斜向空间关系建模的限制。 spatial relationship large language model
34 RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images RAWDet-7:用于量化RAW图像目标检测与描述的多场景基准数据集 spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
35 Continuous Control of Editing Models via Adaptive-Origin Guidance 提出AdaOr自适应调整编辑模型,实现文本引导图像/视频编辑的平滑强度控制。 manipulation classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页