cs.CV(2025-10-28)

📊 共 25 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (4) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗2) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection 提出POVFNDB基准,用于多模态大语言模型在视频假新闻检测中感知、理解和推理能力的细粒度评估。 large language model multimodal chain-of-thought
2 FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning FT-ARM:用于压力性溃疡分级的Agentic自反思多模态大语言模型 large language model multimodal
3 Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs Latent Sketchpad:利用草图视觉思维提升多模态大语言模型的推理能力 large language model multimodal
4 MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition 提出基于多路径跨模态交互的混合网络MCIHN,用于提升多模态情感识别性能。 multimodal
5 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation 提出Ming-Flash-Omni,一种稀疏统一架构,用于多模态感知与生成。 multimodal
6 Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks 提出Mars-Bench火星科学基准,评估火星任务中Foundation模型的性能 foundation model
7 SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs 提出SCOPE,一种面向显著性和覆盖率的多模态大语言模型视觉Token剪枝方法。 large language model multimodal
8 AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts 提出AutoPrompt,利用LLM自动生成对抗性提示,实现对文本到图像模型的黑盒红队测试。 large language model zero-shot transfer
9 Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance ProMoE:通过显式路由指导,提升扩散Transformer在图像生成任务上的性能 large language model
10 OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents OSWorld-MCP:用于评估计算机使用Agent中MCP工具调用能力的新基准 multimodal
11 Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 提出UAP-SAM2:一种针对SAM2的跨提示通用对抗攻击方法 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
12 DeshadowMamba: Deshadowing as 1D Sequential Similarity DeshadowMamba:提出基于一维序列相似性的阴影去除方法,实现更精确的阴影消除。 Mamba state space model contrastive learning
13 UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations 提出UHKD,通过频域特征统一异构模型知识蒸馏框架 teacher-student distillation
14 The Generation Phases of Flow Matching: a Denoising Perspective 从去噪角度解析Flow Matching的生成阶段,揭示其内在机理 flow matching
15 Fast and accurate neural reflectance transformation imaging through knowledge distillation 提出DisK-NeuralRTI,通过知识蒸馏加速高精度神经反射变换成像。 distillation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
16 World Simulation with Video Foundation Models for Physical AI NVIDIA发布Cosmos-Predict2.5,用于物理AI的世界模拟,实现高质量视频生成和指令对齐。 sim2real reinforcement learning foundation model
17 Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning 提出模态感知SAM,通过梯度调制和谐多模态学习,提升泛化性。 manipulation multimodal
18 OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation 提出OmniText以解决文本图像操控中的多项挑战 manipulation latent optimization

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
19 TeleEgo: Benchmarking Egocentric AI Assistants in the Wild TeleEgo:构建真实场景下自我中心AI助手的长时程、流式多模态评测基准 egocentric
20 Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras Kineo:提出一种基于稀疏RGB相机的免标定度量运动捕捉方法。 markerless motion capture

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
21 Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation 提出首个事件相机微眼跳数据集,并用脉冲神经网络实现高精度识别。 optical flow spatiotemporal
22 Generative View Stitching 提出生成式视图拼接(GVS)方法,解决相机引导视频生成中的碰撞和不一致问题。 affordance

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
23 DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery DogMo:用于四足动物运动恢复的大规模多视角RGB-D数据集 motion recovery

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
24 Rethinking Visual Intelligence: Insights from Video Pretraining 视频预训练赋能视觉智能:基于视频扩散模型的深度分析与应用 spatiotemporal large language model foundation model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
25 Group Relative Attention Guidance for Image Editing 提出Group Relative Attention Guidance,实现Diffusion-in-Transformer模型图像编辑的精细可控。 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页