cs.CV（2026-04-27）

📊 共 31 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies	提出CF-VLA，通过粗到精的两阶段动作生成方法提升视觉-语言-动作策略的效率。	embodied AI vision-language-action VLA	✅
2	QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering	提出QEVA：一种基于多模态问答的叙事视频摘要无参考评价指标	large language model multimodal
3	Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis	提出HPDP框架，利用层级原型和领域先验提升多模态病理图像MIL分析性能。	large language model multimodal
4	Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation	Tuna-2：像素嵌入超越视觉编码器，实现多模态理解与生成	multimodal
5	Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction	大规模基准测试病理学预训练模型在乳腺癌生存预测中的性能	foundation model
6	Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation	提出Positive-and-Negative Decoding框架，缓解视觉语言模型中的对象幻觉问题	visual grounding
7	EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT	EXACT：用于3D胸部CT分析的可解释异常感知视觉基础模型	foundation model
8	Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues	提出语言引导语义线索，提升MLLM在遮挡和小物体场景下的鲁棒性	large language model multimodal
9	NeuroClaw Technical Report	NeuroClaw：用于可执行和可复现神经影像研究的领域专用多智能体研究助手	multimodal	✅
10	Meta-CoT: Enhancing Granularity and Generalization in Image Editing	Meta-CoT：通过细粒度和泛化能力增强图像编辑	chain-of-thought	✅
11	Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift	提出MG-MTTA，解决视觉-语言模型在模态特定偏移下的测试时自适应问题	multimodal
12	Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data	Zero-to-CAD：无需真实数据，百万规模合成可解释的CAD程序	large language model
13	Don't Pause! Every prediction matters in a streaming video	提出SPOT-Bench评估流视频理解模型的实时性，并提出AsynKV提升性能。	TAMP
14	SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs	提出SMoES，通过模态引导专家特化提升MoE-VLM的性能与效率	multimodal
15	LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models	LearnPruner：重新思考视觉语言模型中基于注意力的Token剪枝	large language model
16	GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction	提出GoClick轻量级GUI元素定位模型，用于资源受限设备上的自主GUI交互。	visual grounding

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
17	DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery	DeepTaxon：一种可解释的检索增强多模态框架，用于统一物种识别与发现	reinforcement learning multimodal chain-of-thought
18	World-R1: Reinforcing 3D Constraints for Text-to-Video Generation	World-R1：通过强化学习增强3D约束的文本到视频生成框架	reinforcement learning geometric consistency foundation model
19	Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification	提出多视角协同学习框架MVSL，解决低资源生物医学图像分类难题。	representation learning contrastive learning large language model
20	Self-Supervised Representation Learning via Hyperspherical Density Shaping	提出HyDeS：一种基于超球面密度塑造的自监督表征学习方法	representation learning
21	See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection	ForeSight：利用低级视觉线索和反馈提升VLM的推理能力	reinforcement learning multimodal
22	POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation	提出POCA框架，通过帕累托最优课程对齐解决视觉文本生成中准确率与一致性的权衡问题。	reinforcement learning instruction following

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images	提出TSMNet，融合文本监督与视觉表征，解决多模态遥感图像的开放词汇语义分割问题。	open-vocabulary open vocabulary multimodal	✅
24	Light 'em Up: Enabling Few-Shot Low-Light 3D Gaussian Splatting with Multi-Scale Explicit Retinex Illumination Decoupling	提出MERID-GS，通过多尺度Retinex解耦实现弱光环境下的Few-Shot 3D高斯溅射	3D gaussian splatting gaussian splatting splatting	✅
25	Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures	LAGRNet：通过可学习代数群和环结构的神经网络进行单目深度估计	depth estimation monocular depth OMOMO
26	Multivariate Gaussian NeRF for Wide Field-of-View Ultrasound Reconstruction	提出Ultra-Wide-NeRF，用于宽视野超声重建，解决伪影和混叠问题。	NeRF
27	Computer Vision-Based Early Detection of Container Loss at Sea	提出基于计算机视觉的集装箱船早期失稳检测系统，降低海上集装箱损失	optical flow IMoS

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation	DYMAPIA：融合多域信息的深度伪造视频检测框架，提升检测精度与效率。	manipulation optical flow
29	Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics	提出新方法评估CLIP对360度文本与视觉语义的理解	manipulation	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Unconstrained Multi-view Human Pose Estimation with Algebraic Priors	提出基于代数先验的无约束多视角人体姿态估计框架，无需相机标定。	human motion
31	PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms	PointTransformerX：无需稀疏算法的高效便携式3D点云处理	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页