cs.CV（2026-02-23）

📊 共 36 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗7) 支柱二：RL算法与架构 (RL & Architecture) (13 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Test-Time Computing for Referring Multimodal Large Language Models	提出ControlMLLM++，通过测试时计算实现Referring MLLM的区域级视觉推理。	large language model multimodal	✅
2	Universal Pose Pretraining for Generalizable Vision-Language-Action Policies	提出Pose-VLA，解耦视觉-语言-动作模型中的感知与动作对齐问题，提升泛化性。	vision-language-action VLA
3	MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models	MICON-Bench：统一多模态模型中多图上下文图像生成能力的基准测试与增强	large language model multimodal	✅
4	Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device	提出Mobile-O，一种在移动设备上实现统一多模态理解和生成的紧凑型模型。	multimodal	✅
5	Do Large Language Models Understand Data Visualization Rules?	评估大型语言模型理解数据可视化规则的能力，并探索其作为规则验证器的潜力。	large language model
6	StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues	StructXLIP：利用多模态结构线索增强视觉-语言模型，提升跨模态检索性能。	multimodal	✅
7	Do Large Language Models Understand Data Visualization Principles?	评估大型语言模型理解数据可视化原则的能力，并探索其在图表验证与修复中的应用。	large language model
8	Closing the gap in multimodal medical representation alignment	提出一种模态无关框架，弥合医学多模态表征对齐中的模态差距	multimodal
9	CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning	提出跨层协同表征（CLCR）方法，解决多模态学习中的语义不对齐和误差传播问题。	multimodal
10	Vinedresser3D: Agentic Text-guided 3D Editing	Vinedresser3D：提出基于Agent的文本引导3D编辑框架，实现高质量、精确的3D资产修改。	large language model multimodal
11	ApET: Approximation-Error Guided Token Compression for Efficient VLMs	ApET：通过近似误差引导的token压缩，提升视觉语言模型效率	multimodal	✅
12	Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness	提出即插即用模块，提升视觉语言模型在罕见物体上的推理能力	foundation model
13	CountEx: Fine-Grained Counting via Exemplars and Exclusion	CountEx：通过范例和排除实现细粒度计数，解决现有方法易混淆对象的问题。	multimodal	✅
14	PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention	提出PA-Attack，通过原型引导和注意力机制增强LVLM视觉编码器的灰盒攻击。	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
15	M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting	M3S-Net：基于多尺度数据的多模态融合网络，用于超短期光伏功率预测	Mamba penetration spatiotemporal	✅
16	Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis	提出原型引导数据合成的多模态数据集蒸馏方法，提升跨架构泛化能力。	distillation multimodal
17	Generative 6D Pose Estimation via Conditional Flow Matching	提出条件流匹配方法以解决6D姿态估计问题	flow matching 6D pose estimation feature matching	✅
18	Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation	提出双教师蒸馏框架，提升多光谱遥感图像特征表达能力	representation learning distillation foundation model
19	DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation	DerMAE：利用条件潜在扩散和MAE蒸馏提升皮肤病灶分类性能	MAE distillation
20	RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection	提出RL-RIG以解决图像生成中的空间推理问题	reinforcement learning spatial relationship chain-of-thought
21	TextShield-R1: Reinforced Reasoning for Tampered Text Detection	提出TextShield-R1，基于强化学习的多模态大语言模型用于篡改文本检测与推理。	reinforcement learning large language model multimodal
22	Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery	提出SSR²-GCD框架，通过半监督速率降低实现多模态表征学习，用于广义类别发现。	representation learning
23	HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies	提出HOCA-Bench基准测试，评估视频LLM在本体因果异常预测世界建模能力	world model
24	Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection	Fore-Mamba3D：基于Mamba的前景增强编码用于3D目标检测	Mamba
25	Laplacian Multi-scale Flow Matching for Generative Modeling	LapFlow：提出拉普拉斯多尺度流匹配方法，提升图像生成质量与效率。	flow matching
26	UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment	提出UrbanAlign以解决视觉语言模型与人类偏好对齐问题	reinforcement learning PULSE
27	Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy	Prefer-DAS：利用局部偏好和稀疏提示学习电子显微镜图像的领域自适应分割	direct preference optimization contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
28	RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing	提出RAP以解决3D Gaussian Splatting中的重要性评分预测问题	3D gaussian splatting 3DGS gaussian splatting	✅
29	Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting	提出增强辐射场，通过显式建模高光效应提升3D高斯溅射渲染质量。	3D gaussian splatting 3DGS gaussian splatting	✅
30	Open-vocabulary 3D scene perception in industrial environments	提出一种免训练的开放词汇3D场景感知方法，用于工业环境	open-vocabulary open vocabulary foundation model
31	VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments	提出VGGT-MPR，利用视觉几何Transformer增强多模态地点识别，提升自动驾驶定位精度。	VGGT multimodal
32	SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis	SemanticNVS：通过语义信息增强生成式新视角合成的场景理解	scene understanding
33	One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image	One2Scene：单图生成几何一致可探索3D场景	depth estimation gaussian splatting splatting
34	DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces	DICArt：提出基于离散扩散的类别级铰接物体姿态估计方法	6D pose estimation embodied AI
35	TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding	TraceVision：提出轨迹感知的视觉-语言模型，实现类人空间理解	scene understanding

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures	TeHOR：提出文本引导的3D人体与物体纹理重建框架，解决非接触交互建模难题。	human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页