cs.CV(2025-09-26)

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱八:物理动画 (Physics-based Animation) (1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Explaining multimodal LLMs via intra-modal token interactions 通过模态内token交互增强多模态LLM的可解释性 large language model multimodal
2 JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation JanusVLN:利用双重隐式记忆解耦语义与空间信息,提升视觉语言导航性能 VLN large language model multimodal
3 Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model 提出基于多模态通用模型的睡眠分期新范式,提升PSG分析的准确性和鲁棒性 multimodal
4 Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results 利用GPT-4o模型,结合优化Prompt工程,解决多模态信息伪造检测难题 multimodal
5 MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning 提出MILR,一种测试时潜在推理方法,提升多模态图像生成质量。 multimodal
6 DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images 提出基于Stable Diffusion的多模态方法DeHate,以减轻图像中的仇恨言论。 multimodal
7 DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation DynaNav:针对高效视觉导航的动态特征与层选择框架 embodied AI foundation model
8 FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning FishAI 2.0:结合多模态少样本学习进行海洋鱼类图像分类 large language model multimodal
9 LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision 提出Labeling Copilot,用于计算机视觉中自动化数据标注的深度研究Agent。 foundation model multimodal
10 UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning 提出UML-CoT框架,利用UML进行机器人房间清洁任务的结构化推理与规划 large language model chain-of-thought
11 Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation EAGLE:轻量级黑盒框架,解释多模态大语言模型自回归token生成过程。 large language model multimodal
12 Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models 提出自适应全局上下文注入(AGCI)以解决大视觉语言模型中的空间偏见问题 large language model multimodal
13 CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process CircuitSense:提出电路系统基准,桥接工程设计中的视觉理解与符号推理。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
14 On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations 提出RobustVLA,增强视觉-语言-动作模型在多模态扰动下的鲁棒性 flow matching vision-language-action VLA
15 Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging 提出基于迁移学习的多模态切片交互网络,用于肺癌PET/CT图像IGTV精确分割 Mamba multimodal
16 TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses 提出TRUST,利用不确定性引导的SSM遍历进行测试时优化,提升模型在分布偏移下的鲁棒性。 Mamba SSM state space model
17 SPARK: Synergistic Policy And Reward Co-Evolving Framework SPARK:协同策略与奖励共同进化的LLM/LVLM强化学习框架 reinforcement learning RLHF large language model
18 PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning 提出PSTTS即插即用模块,有效降低事件数据时空表示学习的计算冗余。 Mamba representation learning
19 VideoScore2: Think before You Score in Generative Video Evaluation VideoScore2:提出多维度、可解释的视频生成评估框架,提升评估准确性和可控性。 reinforcement learning chain-of-thought
20 CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning 提出CapRL,利用强化学习提升图像描述的稠密性和实用性 reinforcement learning
21 NIFTY: a Non-Local Image Flow Matching for Texture Synthesis NIFTY:一种用于纹理合成的非局部图像流匹配方法 flow matching
22 Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models 提出基于规则的强化学习方法,提升文档图像分类的泛化能力。 reinforcement learning
23 ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models 提出ERGO,通过高效高分辨率视觉理解提升视觉-语言模型性能 reinforcement learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
24 Learning Unified Representation of 3D Gaussian Splatting 提出基于连续子流形场的3D高斯点云统一表示方法 3D gaussian splatting 3DGS gaussian splatting
25 Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics 提出轻量级结构化多模态推理框架,用于机器人临床场景理解 scene understanding multimodal chain-of-thought
26 GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting GaussianVision:利用2D高斯溅射从压缩图像表示中实现视觉-语言对齐 gaussian splatting splatting multimodal
27 EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model EfficientDepth:一种快速且保留细节的单目深度估计模型 depth estimation monocular depth geometric consistency
28 CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach 提出CCNeXt,一种高效的自监督立体深度估计方法,适用于计算资源受限的场景。 depth estimation stereo depth

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
29 Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs 提出DeeptraceReward基准,利用多模态LLM学习AI生成视频中人类感知的伪造痕迹。 spatiotemporal multimodal TAMP

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
30 MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning MesaTask:提出基于3D空间推理的任务驱动型桌面场景生成框架 manipulation DPO physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页