cs.CV(2026-02-23)

📊 共 36 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗7) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Test-Time Computing for Referring Multimodal Large Language Models 提出ControlMLLM++,通过测试时计算实现Referring MLLM的区域级视觉推理。 large language model multimodal
2 Universal Pose Pretraining for Generalizable Vision-Language-Action Policies 提出Pose-VLA,解耦视觉-语言-动作模型中的感知与动作对齐问题,提升泛化性。 vision-language-action VLA
3 MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models MICON-Bench:统一多模态模型中多图上下文图像生成能力的基准测试与增强 large language model multimodal
4 Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device 提出Mobile-O,一种在移动设备上实现统一多模态理解和生成的紧凑型模型。 multimodal
5 Do Large Language Models Understand Data Visualization Rules? 评估大型语言模型理解数据可视化规则的能力,并探索其作为规则验证器的潜力。 large language model
6 StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues StructXLIP:利用多模态结构线索增强视觉-语言模型,提升跨模态检索性能。 multimodal
7 Do Large Language Models Understand Data Visualization Principles? 评估大型语言模型理解数据可视化原则的能力,并探索其在图表验证与修复中的应用。 large language model
8 Closing the gap in multimodal medical representation alignment 提出一种模态无关框架,弥合医学多模态表征对齐中的模态差距 multimodal
9 CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning 提出跨层协同表征(CLCR)方法,解决多模态学习中的语义不对齐和误差传播问题。 multimodal
10 Vinedresser3D: Agentic Text-guided 3D Editing Vinedresser3D:提出基于Agent的文本引导3D编辑框架,实现高质量、精确的3D资产修改。 large language model multimodal
11 ApET: Approximation-Error Guided Token Compression for Efficient VLMs ApET:通过近似误差引导的token压缩,提升视觉语言模型效率 multimodal
12 Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness 提出即插即用模块,提升视觉语言模型在罕见物体上的推理能力 foundation model
13 CountEx: Fine-Grained Counting via Exemplars and Exclusion CountEx:通过范例和排除实现细粒度计数,解决现有方法易混淆对象的问题。 multimodal
14 PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention 提出PA-Attack,通过原型引导和注意力机制增强LVLM视觉编码器的灰盒攻击。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
15 M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting M3S-Net:基于多尺度数据的多模态融合网络,用于超短期光伏功率预测 Mamba penetration spatiotemporal
16 Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis 提出原型引导数据合成的多模态数据集蒸馏方法,提升跨架构泛化能力。 distillation multimodal
17 Generative 6D Pose Estimation via Conditional Flow Matching 提出条件流匹配方法以解决6D姿态估计问题 flow matching 6D pose estimation feature matching
18 Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation 提出双教师蒸馏框架,提升多光谱遥感图像特征表达能力 representation learning distillation foundation model
19 DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation DerMAE:利用条件潜在扩散和MAE蒸馏提升皮肤病灶分类性能 MAE distillation
20 RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection 提出RL-RIG以解决图像生成中的空间推理问题 reinforcement learning spatial relationship chain-of-thought
21 TextShield-R1: Reinforced Reasoning for Tampered Text Detection 提出TextShield-R1,基于强化学习的多模态大语言模型用于篡改文本检测与推理。 reinforcement learning large language model multimodal
22 Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery 提出SSR²-GCD框架,通过半监督速率降低实现多模态表征学习,用于广义类别发现。 representation learning
23 HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies 提出HOCA-Bench基准测试,评估视频LLM在本体因果异常预测世界建模能力 world model
24 Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection Fore-Mamba3D:基于Mamba的前景增强编码用于3D目标检测 Mamba
25 Laplacian Multi-scale Flow Matching for Generative Modeling LapFlow:提出拉普拉斯多尺度流匹配方法,提升图像生成质量与效率。 flow matching
26 UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment 提出UrbanAlign以解决视觉语言模型与人类偏好对齐问题 reinforcement learning PULSE
27 Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy Prefer-DAS:利用局部偏好和稀疏提示学习电子显微镜图像的领域自适应分割 direct preference optimization contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
28 RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing 提出RAP以解决3D Gaussian Splatting中的重要性评分预测问题 3D gaussian splatting 3DGS gaussian splatting
29 Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting 提出增强辐射场,通过显式建模高光效应提升3D高斯溅射渲染质量。 3D gaussian splatting 3DGS gaussian splatting
30 Open-vocabulary 3D scene perception in industrial environments 提出一种免训练的开放词汇3D场景感知方法,用于工业环境 open-vocabulary open vocabulary foundation model
31 VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments 提出VGGT-MPR,利用视觉几何Transformer增强多模态地点识别,提升自动驾驶定位精度。 VGGT multimodal
32 SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis SemanticNVS:通过语义信息增强生成式新视角合成的场景理解 scene understanding
33 One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image One2Scene:单图生成几何一致可探索3D场景 depth estimation gaussian splatting splatting
34 DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces DICArt:提出基于离散扩散的类别级铰接物体姿态估计方法 6D pose estimation embodied AI
35 TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding TraceVision:提出轨迹感知的视觉-语言模型,实现类人空间理解 scene understanding

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
36 TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures TeHOR:提出文本引导的3D人体与物体纹理重建框架,解决非接触交互建模难题。 human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页