cs.CV(2026-03-31)

📊 共 38 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Adversarial Prompt Injection Attack on Multimodal Large Language Models 提出针对多模态大语言模型的不可察觉视觉提示注入攻击方法 large language model multimodal instruction following
2 Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism 提出FlexMem,通过视觉记忆机制增强多模态大语言模型对长视频的理解能力 large language model multimodal
3 Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization ViTAS:通过选择性关注重要区域,提升多模态放射学报告摘要生成效果 multimodal
4 Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis 提出Unify-Agent,用于解决世界知识驱动图像合成中长尾概念生成难题。 multimodal
5 Few-shot Writer Adaptation via Multimodal In-Context Learning 提出基于多模态上下文学习的少样本手写体作者自适应方法 multimodal
6 Multimodal Models Meet Presentation Attack Detection on ID Documents 利用多模态模型进行身份证件的呈现攻击检测研究 multimodal
7 Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions 提出自适应上下文压缩框架,解决LLM长程交互中的性能退化问题 large language model
8 EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos EC-Bench:用于超长视频枚举和计数的基准测试,挑战多模态大语言模型。 large language model multimodal
9 CutClaw: Agentic Hours-Long Video Editing via Music Synchronization CutClaw:提出基于音乐同步的多智能体框架,实现小时级视频的自动剪辑。 multimodal
10 MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network 提出MELT网络,通过平衡频率和稀有性来改进组合图像检索。 multimodal
11 Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding Omni-NegCLIP通过前层对比微调增强CLIP对否定语句的理解能力 multimodal
12 Diffusion Mental Averages 提出Diffusion Mental Averages (DMA),在扩散模型中生成概念的清晰逼真“心理平均”图像。 multimodal
13 LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning LatentPilot:利用潜在视觉推理进行前瞻性规划的场景感知视觉-语言导航 VLN

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
14 LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting LightHarmony3D:在3D高斯溅射中实现光照和阴影协调的对象插入 3D gaussian splatting 3DGS gaussian splatting
15 AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting 提出AA-Splat,解决FF-3DGS在新视角合成中因采样率变化导致的渲染伪影问题 3D gaussian splatting 3DGS gaussian splatting
16 Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting 提出SplatHLoc,利用特征高斯溅射进行分层视觉重定位,提升鲁棒性。 gaussian splatting splatting feature matching
17 MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting MotionScale:通过可扩展的4D高斯溅射重建动态场景的外观、几何和运动 gaussian splatting splatting
18 ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation 提出ConInfer,通过上下文感知推理解决遥感图像的免训练开放词汇分割问题。 open-vocabulary open vocabulary
19 StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision 提出StereoVGGT,一种免训练的立体视觉几何Transformer,显著提升立体匹配性能。 depth estimation monocular depth VGGT
20 Hallucination-aware intermediate representation edit in large vision-language models 提出幻觉感知中间表示编辑框架,有效消除大型视觉语言模型中的幻觉问题。 scene understanding multimodal
21 GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis 提出GRVS:一种通用且循环的单目动态视角合成方法 gaussian splatting splatting
22 Extend3D: Town-Scale 3D Generation Extend3D:提出一种基于单张图像的城市级3D场景生成无训练流程。 monocular depth
23 M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding M2H-MX:用于实时单目空间理解的多任务密集视觉感知模型 semantic map

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
24 Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement TriDerm:利用专家三重比较评估多模态慢性伤口嵌入,提升罕见皮肤病相似病例检索。 representation learning large language model foundation model
25 SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes SceneTeract:提出具身智能体功能可供性和VLM在3D场景中的对齐框架 distillation scene understanding affordance
26 Scaling Video Pretraining for Surgical Foundation Models SurgRec:可扩展的手术视频预训练模型,用于构建手术领域的基础模型。 JEPA MAE foundation model
27 Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge 提出QUAD框架,实现边缘设备上多LoRA自适应的生成视觉模型量化与部署。 distillation foundation model
28 Square Superpixel Generation and Representation Learning via Granular Ball Computing 提出基于粒计算的方形超像素生成与表征学习方法,提升视觉任务性能。 representation learning

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters 提出MaskAdapt以解决物理基础角色的灵活运动适应问题 humanoid humanoid control motion adaptation
30 FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing 提出FED-Bench,用于解耦评估面部表情编辑的跨粒度基准。 manipulation instruction following
31 Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions 提出工具引导的视觉-语言模型框架,解决VLM在视觉错觉识别中的系统性偏差问题。 manipulation
32 A2BFR: Attribute-Aware Blind Face Restoration 提出A$^2$BFR框架,实现属性可控的高保真盲人脸修复。 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors 提出DynMask,通过运动谱描述器实现复杂度感知的掩码运动生成。 text-to-motion motion synthesis motion generation
34 Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition 提出自适应边缘差异训练的情感扩散分类器,提升面部表情识别的鲁棒性。 motion diffusion

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
35 PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models PRISM:用于具身视觉-语言模型的多视角零售视频数据集 egocentric chain-of-thought
36 Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras 提出新颖性过滤方法,提升边缘相机跨模态检索性能 egocentric

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
37 Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection 利用合成数据增强以自我为中心的视角下人-物交互检测 HOI egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
38 SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition SkeletonContext:利用骨骼上下文提示学习实现零样本骨骼动作识别 motion representation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页