cs.CV（2026-01-29）

📊 共 41 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (19 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗3) 支柱一：机器人控制 (Robot Control) (5 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (19 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models	Vision-DeepResearch：通过多轮多实体多尺度搜索，提升多模态大语言模型在复杂视觉任务中的表现。	large language model foundation model multimodal	✅
2	RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning	提出RSGround-R1以解决遥感视觉定位中的空间推理问题	large language model multimodal visual grounding
3	Thinker: A vision-language foundation model for embodied intelligence	Thinker：面向具身智能的视觉-语言基础模型，解决机器人感知与推理难题	foundation model visual grounding chain-of-thought
4	UEval: A Benchmark for Unified Multimodal Generation	UEval：一个用于评估统一多模态生成模型的基准测试。	large language model multimodal
5	MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods	MMFineReason：通过开放数据中心方法弥合多模态推理差距	multimodal chain-of-thought
6	CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models	提出CG-MLLM以解决3D内容生成的低分辨率问题	large language model multimodal
7	MultiModal Fine-tuning with Synthetic Captions	提出基于多模态大语言模型生成合成字幕的多模态微调方法，提升图像分类性能。	large language model multimodal	✅
8	Understanding Multimodal Complementarity for Single-Frame Action Anticipation	提出AAG+单帧动作预测框架，融合多模态信息，性能媲美视频方法。	multimodal
9	VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models	VideoAesBench：用于评估大型多模态模型视频美学感知能力的综合基准测试。	multimodal
10	When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning	提出Dispersive and Anchoring Geometric Regularizer，解决多模态学习中的几何结构病态问题。	multimodal
11	Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking	提出基于超网络的自适应聚合Transformer，用于预测冠状动脉钙化消融术的需求。	multimodal	✅
12	Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation	UniMRG：通过多表示生成增强统一多模态模型的理解能力	multimodal
13	Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations	病理学预训练模型通过表征空间中的伪时间分析编码疾病进展	foundation model
14	ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing	提出ChartE$^{3}$基准，用于端到端图表编辑的全面评估与能力提升。	large language model multimodal
15	LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models	LAMP：通过预训练模型学习多图像任务的通用对抗扰动	large language model multimodal
16	Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation	提出DGNav，解决视觉-语言导航中拓扑地图粒度刚性问题，提升导航性能。	VLN	✅
17	OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models	提出OCRVerse，实现端到端视觉语言模型中的整体OCR，统一处理文本和视觉元素。	multimodal
18	Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention	Spava：通过序列并行近似注意力加速长视频理解	multimodal	✅
19	MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations	MPF-Net：通过分层流形偏差与微观时间波动揭示高保真AI生成视频伪造	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving	Drive-JEPA：融合视频JEPA与多模态轨迹蒸馏的端到端自动驾驶框架	world model distillation scene understanding
21	Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery	提出HMRMamba，利用几何感知和运动引导实现更精确的视频人体网格重建	Mamba SSM state space model
22	Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification	提出多模态视觉代理压缩MVSC，用于提升阿尔茨海默病分类精度。	representation learning foundation model multimodal
23	CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection	提出CAF-Mamba，基于Mamba的跨模态自适应注意力融合框架，用于多模态抑郁症检测。	Mamba multimodal
24	Improving Classifier-Free Guidance of Flow Matching via Manifold Projection	提出基于流匹配流形投影的无分类器引导方法，提升生成质量与控制性	flow matching classifier-free guidance
25	PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization	PathReasoner-R1：通过知识引导的策略优化，为病理学视觉-语言模型注入结构化推理能力	reinforcement learning distillation chain-of-thought	✅
26	Learning Transient Convective Heat Transfer with Geometry Aware World Models	提出几何感知世界模型，用于学习瞬态对流换热过程	world model
27	WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models	WorldBench：用于诊断世界模型物理理解能力的解耦视频基准测试	world model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
28	MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources	MetricAnything：利用噪声异构数据源扩展度量深度预训练	depth estimation monocular depth metric depth	✅
29	Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery	提出SDCI框架，解决遥感影像开放词汇语义分割中几何定位和语义预测难题	open-vocabulary open vocabulary foundation model	✅
30	From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding	揭示NeRF在密集场景中几何退化问题，提出基于体素栅格化的显式几何重建方法	NeRF neural radiance field scene understanding
31	PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction	PLANING：一种用于流式3D重建的松耦合三角-高斯框架	gaussian splatting splatting embodied AI	✅
32	Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference	提出轻量级高保真低比特率的3D人脸压缩方法以解决视频会议问题	3DGS NeRF

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization	提出基于重构误差的RT-DeepLoc框架，用于弱监督多模态Deepfake时序定位。	manipulation masked autoencoder MAE
34	EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers	EditYourself：基于扩散Transformer的音频驱动说话人头部视频生成与编辑	manipulation human motion spatiotemporal
35	TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention	提出TraceRouter以解决大型基础模型的安全性问题	manipulation foundation model
36	DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning	DreamActor-M2：基于时空上下文学习的通用角色图像动画框架	humanoid spatiotemporal	✅
37	Causal World Modeling for Robot Control	LingBot-VA：基于因果世界模型的机器人控制框架，提升长时程操作和泛化能力。	manipulation world model

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning	提出金字塔Shapley-Taylor学习框架，实现细粒度运动-语言检索	human motion
39	HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence	HiFi-Mesh：通过紧凑自回归依赖实现高保真高效3D网格生成	geometric consistency spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	PI-Light: Physics-Inspired Diffusion for Full-Image Relighting	提出PI-Light，利用物理启发的扩散模型实现全图像光照重定向	physically plausible

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion	提出PaFu-KV缓存策略，通过显著性估计提升自回归视频扩散模型的效率和质量。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页