cs.CV(2025-07-06)

📊 共 19 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models 提出基于多智能体深度研究的多模态大语言模型,用于多媒体内容验证。 large language model multimodal
2 OmniVec2 -- A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning OmniVec2:一种用于大规模多模态多任务学习的新型Transformer网络 multimodal
3 SFOOD: A Multimodal Benchmark for Comprehensive Food Attribute Analysis Beyond RGB with Spectral Insights SFOOD:构建大规模多模态食品属性分析基准,融合光谱信息超越RGB局限 multimodal
4 ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition 提出ViTaL数据集与ViTaL-Net,用于多病理卵巢肿瘤的多模态识别 multimodal
5 ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts ZERO:面向工业界的多模态提示视觉基础模型,实现零样本泛化 foundation model
6 CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step CoT-Diff:通过链式推理强化文本到图像生成中的空间布局对齐 large language model multimodal
7 Computed Tomography Visual Question Answering with Cross-modal Feature Graphing 提出基于跨模态特征图的CT图像视觉问答框架,提升诊断准确性 large language model multimodal
8 SeqTex: Generate Mesh Textures in Video Sequence SeqTex:提出一种视频序列中的网格纹理生成方法,实现端到端UV纹理映射。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
9 Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning 通过思维链提示和强化学习增强视觉语言模型中的空间推理能力 reinforcement learning chain-of-thought
10 MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture 提出MVNet,融合Mamba和Transformer,提升高光谱遥感图像分类精度与效率。 Mamba SSM state space model
11 RegistrationMamba: A Mamba-based Registration Framework Integrating Multi-Expert Feature Learning for Cross-Modal Remote Sensing Images 提出RegistrationMamba,融合多专家特征学习,提升跨模态遥感图像配准精度。 Mamba SSM state space model
12 MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection MambaFusion:提出一种高效的高度保真稠密全局融合方法,用于多模态3D目标检测。 Mamba SSM linear attention
13 MambaVideo for Discrete Video Tokenization with Channel-Split Quantization 提出基于Mamba的视频离散Token化方法,结合通道分离量化,显著提升视频生成效果。 Mamba
14 MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation MPQ-DMv2:面向低比特扩散模型的灵活残差混合精度量化与时序蒸馏 distillation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
15 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge DreamVLA:融合全面世界知识的视觉-语言-动作模型,提升机器人操作的泛化性和推理能力。 manipulation mutual attention vision-language-action
16 Grounded Gesture Generation: Language, Motion, and Space 提出基于多模态数据集和物理引擎的具身手势生成框架,解决空间环境感知问题。 locomotion motion generation multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
17 Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop 提出一种无需人工标注和梯度训练的开放词汇3D目标检测方法。 open-vocabulary open vocabulary foundation model
18 A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields 提出基于视图一致性采样的NeRF正则化训练方法,提升真实场景下的新视角合成质量。 depth estimation NeRF neural radiance field

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
19 MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization MVL-Loc:利用视觉-语言模型实现通用多场景相机重定位 spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页