cs.CV（2026-02-11）

📊 共 28 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (10 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (8 🔗3) 支柱一：机器人控制 (Robot Control) (3) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology	DermFM-Zero：用于皮肤科零样本临床协作的视觉-语言基础模型	contrastive learning foundation model multimodal
2	MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning	提出MetaphorStar，利用端到端视觉强化学习解决图像隐喻理解与推理难题。	reinforcement learning large language model multimodal
3	3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars	3DXTalker：统一身份、口型同步、情感和空间动态的表达性3D说话头像生成。	flow matching motion generation
4	HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images	提出HII-DPO，通过对抗图像消除视觉语言模型中的幻觉问题	DPO multimodal
5	Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling	提出DiNa-LRM，一种扩散原生潜在奖励模型，提升扩散模型偏好优化效率。	flow matching preference learning multimodal
6	FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference	FastFlow：利用Bandit推断加速生成流匹配模型	flow matching distillation	✅
7	LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation	LaSSM：基于局部聚合与状态空间模型的3D实例分割	SSM state space model	✅
8	Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data	提出用于高光谱数据回归的光谱-空间对比学习框架，提升模型性能。	representation learning contrastive learning
9	Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning	提出基于无内容多模型导向表征学习的自监督图像超分辨率质量评估方法	representation learning contrastive learning
10	Dual-End Consistency Model	提出双端一致性模型（DE-CM），解决一致性模型训练不稳定和采样不灵活的问题，实现高效图像生成。	flow matching distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
11	RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation	RSHallu：针对遥感多模态大语言模型的幻觉评估与领域定制缓解	large language model multimodal visual grounding
12	C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning	提出C^2RoPE，解决3D多模态大模型推理中RoPE的位置编码局限性问题	large language model multimodal	✅
13	PhyCritic: Multimodal Critic Models for Physical AI	提出PhyCritic，用于提升物理AI任务中多模态模型的评估和对齐能力。	multimodal
14	Ecological mapping with geospatial foundation models	利用地理空间基础模型进行生态制图研究，TerraMind表现优异	foundation model
15	VideoSTF: Stress-Testing Output Repetition in Video Large Language Models	VideoSTF：提出用于评估视频大语言模型中输出重复问题的基准测试框架。	large language model	✅
16	Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance	提出文本引导的弱监督多模态视频异常检测框架，提升异常特征表达。	multimodal
17	Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting	提出Chain-of-Look空间推理框架，解决密集手术器械计数难题	large language model multimodal
18	TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning	提出TwiFF，用于动态视觉推理的大规模数据集与模型	multimodal chain-of-thought	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
19	HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion	HairWeaver：基于扩散模型的少样本逼真头发运动合成	sim-to-real sim2real motion synthesis
20	Towards Learning a Generalizable 3D Scene Representation from 2D Observations	提出一种可泛化的神经辐射场方法，用于机器人全局工作空间三维重建	humanoid humanoid robot manipulation
21	Chatting with Images for Introspective Visual Thinking	提出ViLaVT，通过图像交互式对话增强视觉语言模型内省式视觉推理能力	manipulation reinforcement learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
22	AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models	AugVLA-3D：基于深度驱动特征增强的视觉-语言-动作模型	depth estimation VGGT vision-language-action
23	PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation	PuriLight：一种轻量级的单目深度估计洗牌与净化框架	depth estimation monocular depth	✅
24	Interpretable Vision Transformers in Monocular Depth Estimation via SVDA	提出SVDA的单目深度估计Transformer，实现可解释的自注意力机制	depth estimation monocular depth

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
25	MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps	MapVerse：一个用于评估真实世界地图上地理空间问答能力的大规模基准数据集。	spatial relationship large language model multimodal
26	SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos	SurfPhase：提出一种从稀疏视频重建两相流三维界面动态的方法	geometric consistency

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation	提出MP-HOI框架，利用多模态先验知识生成高质量的文本驱动3D人-物交互动作	motion generation human-object interaction HOI

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories	提出DeepImageSearch，通过Agent范式解决视觉历史中上下文感知图像检索问题	spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页