cs.CV(2025-10-07)

📊 共 33 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Detection and Measurement of Hailstones with Multimodal Large Language Models 利用多模态大语言模型检测和测量冰雹,提升恶劣天气事件评估效率。 large language model multimodal
2 Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation 提出MeDiM,一种基于MLLM的医学多模态离散扩散模型,实现统一的医学图像和文本生成。 large language model foundation model multimodal
3 Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work 评估多模态LLM对手写学生作业的理解和评分能力 large language model multimodal
4 From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding 提出KeyScore,一种基于字幕感知的多模态关键帧评分方法,用于提升视频语言理解。 multimodal
5 Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding Lumina-DiMOO:一种用于多模态生成与理解的Omni扩散大语言模型 large language model
6 Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction FeatProto:用于可解释和判别性癌症生存预测的多模态特征原型学习 multimodal
7 BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data BioAutoML-NAS:基于神经架构搜索的多模态昆虫分类AutoML框架 multimodal
8 FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders FoleyGRAM:利用GRAM对齐的多模态编码器实现视频到音频的生成 multimodal
9 SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets 提出SD-MVSum,利用跨模态注意力机制实现脚本驱动的多模态视频摘要 multimodal
10 Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis 综述论文:扩散模型在低光照图像增强中的应用、分类与性能分析 foundation model multimodal
11 Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping 提出一种可扩展的深度融合框架,利用星载激光雷达和合成孔径雷达进行全球森林结构复杂性制图。 multimodal
12 StereoSync: Spatially-Aware Stereo Audio Generation from Video StereoSync:提出一种空间感知立体声音频生成模型,用于视频配乐。 foundation model
13 ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations 提出ChainMPQ,通过交错文本-图像推理链缓解关系幻觉问题 multimodal
14 Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect 提出FusionDetect,融合CLIP和DINOv2特征,提升伪图像检测的泛化能力。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
15 HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection HOI-R1:探索多模态大语言模型在人-物交互检测中的潜力 reinforcement learning human-object interaction HOI
16 Improving Chain-of-Thought Efficiency for Autoregressive Image Generation 提出ShortCoTI框架,提升自回归图像生成中思维链的效率,减少冗余计算。 reinforcement learning large language model foundation model
17 Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality 提出MMLNet,解决多模态信息传播中模态缺失导致的虚假信息识别鲁棒性问题。 contrastive learning multimodal
18 GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments GAZE:面向零样本世界模型的治理感知预标注流水线 world model scene understanding multimodal
19 Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics Midway Network:通过潜在动态学习进行识别和运动的表征学习 latent dynamics optical flow motion latent
20 When Thinking Drifts: Evidential Grounding for Robust Video Reasoning 提出Visual Evidence Reward (VER)框架,解决视频推理中思维漂移问题。 reinforcement learning multimodal chain-of-thought
21 VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization 提出VideoMiner,通过树状结构和强化学习优化,解决长视频关键帧提取与理解难题。 reinforcement learning spatiotemporal large language model
22 Deforming Videos to Masks: Flow Matching for Referring Video Segmentation 提出FlowRVS以解决视频对象分割中的语言引导问题 flow matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
23 Human3R: Everyone Everywhere All at Once Human3R:提出统一的单目视频4D人体场景重建框架,实现多人、场景和相机轨迹的实时重建。 depth estimation scene reconstruction contact-aware
24 EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark EgoNight:提出首个夜间第一人称视觉理解基准,解决低光照场景下的VQA难题。 depth estimation egocentric egocentric vision
25 Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow Flow4Agent:利用光流运动先验进行长视频理解,提升MLLM性能。 optical flow large language model multimodal
26 When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach 提出一种多模态自动视频编辑方法,用于古典音乐会多机位录制视频的剪辑。 scene understanding multimodal
27 ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars ArchitectHead:提出首个支持连续细节层次控制的3D高斯头部头像框架 3D gaussian splatting 3DGS gaussian splatting
28 Teleportraits: Training-Free People Insertion into Any Scene Teleportraits:提出一种免训练的人物插入方法,实现任意场景下的人物合成 affordance classifier-free guidance affordance-aware
29 Human Action Recognition from Point Clouds over Time 提出一种基于点云序列和稀疏卷积网络的3D人体动作识别方法 depth estimation monocular depth
30 Dropping the D: RGB-D SLAM Without the Depth Sensor DropD-SLAM:无需深度传感器的单目RGB SLAM,达到RGB-D级别精度 metric depth

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
31 Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images 提出基于扩散模型的双手3D运动与姿态预测方法,提升日常图像中的预测精度。 bi-manual multimodal
32 HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video HoloScene:从单视频重建可交互、可仿真的3D场景 manipulation scene understanding

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
33 Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation Text2Interact:提出高保真、多样化的文本驱动双人互动生成框架 two-person interaction spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页