cs.CV(2026-05-13)

📊 共 38 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7) 支柱四:生成式动作 (Generative Motion) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱一:机器人控制 (Robot Control) (3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Learning to See What You Need: Gaze Attention for Multimodal Large Language Models 提出Gaze Attention机制,提升多模态大语言模型视觉关注效率与性能 large language model multimodal
2 VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence VoxCor:一种免训练的体素特征方法,用于多模态体素对应 foundation model multimodal
3 ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence ViDR:提出一种基于源视觉证据的多模态深度研究报告生成框架 large language model multimodal
4 Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics 提出SIAA以解决冻结基础模型在合成图像取证中的脆弱性问题 foundation model
5 LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters LoREnc:通过低秩加密保护基础模型和LoRA适配器,防止模型泄露。 foundation model
6 Dual-Pathway Circuits of Object Hallucination in Vision-Language Models 提出双路径电路分析以解决视觉语言模型中的物体幻觉问题 multimodal visual grounding
7 Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling 提出Edit-Compass与EditReward-Compass,统一评估图像编辑模型与奖励模型。 multimodal instruction following
8 Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering 提出GSEC框架,通过生成式语义引导和双层集成学习,降低图像聚类的偏差和方差。 large language model multimodal
9 Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context 提出MMProLong,通过高效的长上下文持续预训练提升视觉语言模型在长文档理解等任务上的性能。 multimodal
10 Weakly Supervised Segmentation as Semantic-Based Regularization 提出基于语义的正则化弱监督分割方法,提升伪标签质量和分割精度。 foundation model
11 SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification 提出SpurAudio基准,用于研究少样本音频分类中的捷径学习问题。 foundation model
12 FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition 提出FIKA-Bench,用于评估模型在细粒度识别中的知识获取能力 multimodal
13 Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation 利用图像编辑模型早期语义信息,实现零样本指代图像分割 language conditioned
14 CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation 提出CRePE,用于统一相机模型控制的视频生成,提升几何感知能力。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
15 GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting GuardMarkGS:针对3D高斯溅射的统一所有权追踪与编辑威慑框架 3D gaussian splatting 3DGS gaussian splatting
16 HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization 提出HarmoGS,通过冲突感知梯度调和实现复杂场景下鲁棒的3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
17 OCH3R: Object-Centric Holistic 3D Reconstruction OCH3R:单目RGB图像物体中心整体3D重建框架 depth estimation monocular depth metric depth
18 Z-Order Transformer for Feed-Forward Gaussian Splatting 提出基于Z-Order Transformer的前馈高斯溅射方法,加速高质量新视角合成。 3D gaussian splatting 3DGS gaussian splatting
19 Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs 提出SurgMLLM,通过多模态大语言模型统一手术场景理解中的推理与分割。 scene understanding large language model multimodal
20 RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering RoSplat:提出鲁棒的前馈像素级高斯溅射,解决视角变化和高分辨率渲染问题 3D gaussian splatting gaussian splatting splatting
21 PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World 提出PanoWorld,通过球面空间交叉注意力提升MLLM在360°全景图像中的空间理解能力 scene understanding multimodal

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
22 Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation 提出CMC框架,解决轨迹控制人体动作生成中多条件冲突与表示不一致问题 text-to-motion motion generation human motion
23 Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation 提出基于超网络的低秩适应风格化文本到动作生成方法 motion diffusion model motion diffusion text-to-motion
24 ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin 提出ArcVQ-VAE,通过球形向量量化框架提升图像建模中离散表示的质量。 VQ-VAE
25 HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation HetScene:异构感知扩散模型用于稠密室内场景生成 physically plausible embodied AI
26 AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects AssemblyBench:用于复杂工业对象物理感知装配的合成数据集与AssemblyDyno模型 physically plausible multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
27 Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting 提出SCOUP,解耦语言表示学习与3D高斯优化,实现高效3D语言高斯溅射 representation learning 3D gaussian splatting gaussian splatting
28 STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition 提出STAR框架,通过语义时序自适应表示学习解决小样本动作识别中的语义时序错位问题。 Mamba representation learning large language model
29 BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability BrainAnytime:解剖结构感知的跨模态预训练,用于任意模态脑影像分析 masked autoencoder distillation foundation model
30 AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation AnyFlow:基于流图蒸馏的任意步数视频扩散模型,解决一致性蒸馏模型在多步采样时性能下降的问题。 distillation
31 GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models 提出GRIP-VLM,通过强化学习进行组相对重要性剪枝,提升视觉-语言模型的效率。 reinforcement learning multimodal

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
32 Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes 提出Real2Sim以解决自动驾驶场景生成中的现实差距问题 real2sim policy learning gaussian splatting
33 Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection 提出基于光流增强和知识蒸馏的轻量级人脸活体检测方法 manipulation distillation optical flow
34 CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy CoGE:用于单目结肠镜的Sim-to-Real在线几何估计框架 sim-to-real depth estimation scene reconstruction

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
35 Weakly-Supervised Spatiotemporal Anomaly Detection 提出一种弱监督时空异常检测方法,仅使用视频级别标签进行训练。 spatiotemporal
36 DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution DiffST:面向真实世界时空视频超分辨率的时空感知扩散模型 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
37 EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing EgoForce:通过扩散强制实现鲁棒的在线第一人称视角运动重建 egocentric motion reconstruction

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
38 Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation 提出Seg-Agent,实现无需训练的测试时多模态推理语言引导分割 spatial relationship large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页