cs.CV(2026-01-29)

📊 共 41 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (19 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (19 篇)

#题目一句话要点标签🔗
1 Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Vision-DeepResearch:通过多轮多实体多尺度搜索,提升多模态大语言模型在复杂视觉任务中的表现。 large language model foundation model multimodal
2 RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning 提出RSGround-R1以解决遥感视觉定位中的空间推理问题 large language model multimodal visual grounding
3 Thinker: A vision-language foundation model for embodied intelligence Thinker:面向具身智能的视觉-语言基础模型,解决机器人感知与推理难题 foundation model visual grounding chain-of-thought
4 UEval: A Benchmark for Unified Multimodal Generation UEval:一个用于评估统一多模态生成模型的基准测试。 large language model multimodal
5 MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods MMFineReason:通过开放数据中心方法弥合多模态推理差距 multimodal chain-of-thought
6 CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models 提出CG-MLLM以解决3D内容生成的低分辨率问题 large language model multimodal
7 MultiModal Fine-tuning with Synthetic Captions 提出基于多模态大语言模型生成合成字幕的多模态微调方法,提升图像分类性能。 large language model multimodal
8 Understanding Multimodal Complementarity for Single-Frame Action Anticipation 提出AAG+单帧动作预测框架,融合多模态信息,性能媲美视频方法。 multimodal
9 VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models VideoAesBench:用于评估大型多模态模型视频美学感知能力的综合基准测试。 multimodal
10 When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning 提出Dispersive and Anchoring Geometric Regularizer,解决多模态学习中的几何结构病态问题。 multimodal
11 Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking 提出基于超网络的自适应聚合Transformer,用于预测冠状动脉钙化消融术的需求。 multimodal
12 Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation UniMRG:通过多表示生成增强统一多模态模型的理解能力 multimodal
13 Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations 病理学预训练模型通过表征空间中的伪时间分析编码疾病进展 foundation model
14 ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing 提出ChartE$^{3}$基准,用于端到端图表编辑的全面评估与能力提升。 large language model multimodal
15 LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models LAMP:通过预训练模型学习多图像任务的通用对抗扰动 large language model multimodal
16 Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation 提出DGNav,解决视觉-语言导航中拓扑地图粒度刚性问题,提升导航性能。 VLN
17 OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models 提出OCRVerse,实现端到端视觉语言模型中的整体OCR,统一处理文本和视觉元素。 multimodal
18 Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention Spava:通过序列并行近似注意力加速长视频理解 multimodal
19 MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations MPF-Net:通过分层流形偏差与微观时间波动揭示高保真AI生成视频伪造 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
20 Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving Drive-JEPA:融合视频JEPA与多模态轨迹蒸馏的端到端自动驾驶框架 world model distillation scene understanding
21 Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery 提出HMRMamba,利用几何感知和运动引导实现更精确的视频人体网格重建 Mamba SSM state space model
22 Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification 提出多模态视觉代理压缩MVSC,用于提升阿尔茨海默病分类精度。 representation learning foundation model multimodal
23 CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection 提出CAF-Mamba,基于Mamba的跨模态自适应注意力融合框架,用于多模态抑郁症检测。 Mamba multimodal
24 Improving Classifier-Free Guidance of Flow Matching via Manifold Projection 提出基于流匹配流形投影的无分类器引导方法,提升生成质量与控制性 flow matching classifier-free guidance
25 PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization PathReasoner-R1:通过知识引导的策略优化,为病理学视觉-语言模型注入结构化推理能力 reinforcement learning distillation chain-of-thought
26 Learning Transient Convective Heat Transfer with Geometry Aware World Models 提出几何感知世界模型,用于学习瞬态对流换热过程 world model
27 WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models WorldBench:用于诊断世界模型物理理解能力的解耦视频基准测试 world model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
28 MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources MetricAnything:利用噪声异构数据源扩展度量深度预训练 depth estimation monocular depth metric depth
29 Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery 提出SDCI框架,解决遥感影像开放词汇语义分割中几何定位和语义预测难题 open-vocabulary open vocabulary foundation model
30 From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding 揭示NeRF在密集场景中几何退化问题,提出基于体素栅格化的显式几何重建方法 NeRF neural radiance field scene understanding
31 PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction PLANING:一种用于流式3D重建的松耦合三角-高斯框架 gaussian splatting splatting embodied AI
32 Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference 提出轻量级高保真低比特率的3D人脸压缩方法以解决视频会议问题 3DGS NeRF

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
33 Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization 提出基于重构误差的RT-DeepLoc框架,用于弱监督多模态Deepfake时序定位。 manipulation masked autoencoder MAE
34 EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers EditYourself:基于扩散Transformer的音频驱动说话人头部视频生成与编辑 manipulation human motion spatiotemporal
35 TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention 提出TraceRouter以解决大型基础模型的安全性问题 manipulation foundation model
36 DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning DreamActor-M2:基于时空上下文学习的通用角色图像动画框架 humanoid spatiotemporal
37 Causal World Modeling for Robot Control LingBot-VA:基于因果世界模型的机器人控制框架,提升长时程操作和泛化能力。 manipulation world model

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
38 Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning 提出金字塔Shapley-Taylor学习框架,实现细粒度运动-语言检索 human motion
39 HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence HiFi-Mesh:通过紧凑自回归依赖实现高保真高效3D网格生成 geometric consistency spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
40 PI-Light: Physics-Inspired Diffusion for Full-Image Relighting 提出PI-Light,利用物理启发的扩散模型实现全图像光照重定向 physically plausible

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
41 Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion 提出PaFu-KV缓存策略,通过显著性估计提升自回归视频扩散模型的效率和质量。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页