cs.CV（2025-10-07）

📊 共 33 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Detection and Measurement of Hailstones with Multimodal Large Language Models	利用多模态大语言模型检测和测量冰雹，提升恶劣天气事件评估效率。	large language model multimodal
2	Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation	提出MeDiM，一种基于MLLM的医学多模态离散扩散模型，实现统一的医学图像和文本生成。	large language model foundation model multimodal
3	Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work	评估多模态LLM对手写学生作业的理解和评分能力	large language model multimodal
4	From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding	提出KeyScore，一种基于字幕感知的多模态关键帧评分方法，用于提升视频语言理解。	multimodal
5	Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding	Lumina-DiMOO：一种用于多模态生成与理解的Omni扩散大语言模型	large language model	✅
6	Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction	FeatProto：用于可解释和判别性癌症生存预测的多模态特征原型学习	multimodal	✅
7	BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data	BioAutoML-NAS：基于神经架构搜索的多模态昆虫分类AutoML框架	multimodal
8	FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders	FoleyGRAM：利用GRAM对齐的多模态编码器实现视频到音频的生成	multimodal
9	SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets	提出SD-MVSum，利用跨模态注意力机制实现脚本驱动的多模态视频摘要	multimodal	✅
10	Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis	综述论文：扩散模型在低光照图像增强中的应用、分类与性能分析	foundation model multimodal
11	Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping	提出一种可扩展的深度融合框架，利用星载激光雷达和合成孔径雷达进行全球森林结构复杂性制图。	multimodal
12	StereoSync: Spatially-Aware Stereo Audio Generation from Video	StereoSync：提出一种空间感知立体声音频生成模型，用于视频配乐。	foundation model
13	ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations	提出ChainMPQ，通过交错文本-图像推理链缓解关系幻觉问题	multimodal
14	Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect	提出FusionDetect，融合CLIP和DINOv2特征，提升伪图像检测的泛化能力。	foundation model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
15	HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection	HOI-R1：探索多模态大语言模型在人-物交互检测中的潜力	reinforcement learning human-object interaction HOI	✅
16	Improving Chain-of-Thought Efficiency for Autoregressive Image Generation	提出ShortCoTI框架，提升自回归图像生成中思维链的效率，减少冗余计算。	reinforcement learning large language model foundation model
17	Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality	提出MMLNet，解决多模态信息传播中模态缺失导致的虚假信息识别鲁棒性问题。	contrastive learning multimodal	✅
18	GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments	GAZE：面向零样本世界模型的治理感知预标注流水线	world model scene understanding multimodal
19	Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics	Midway Network：通过潜在动态学习进行识别和运动的表征学习	latent dynamics optical flow motion latent
20	When Thinking Drifts: Evidential Grounding for Robust Video Reasoning	提出Visual Evidence Reward (VER)框架，解决视频推理中思维漂移问题。	reinforcement learning multimodal chain-of-thought
21	VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization	提出VideoMiner，通过树状结构和强化学习优化，解决长视频关键帧提取与理解难题。	reinforcement learning spatiotemporal large language model	✅
22	Deforming Videos to Masks: Flow Matching for Referring Video Segmentation	提出FlowRVS以解决视频对象分割中的语言引导问题	flow matching

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Human3R: Everyone Everywhere All at Once	Human3R：提出统一的单目视频4D人体场景重建框架，实现多人、场景和相机轨迹的实时重建。	depth estimation scene reconstruction contact-aware	✅
24	EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark	EgoNight：提出首个夜间第一人称视觉理解基准，解决低光照场景下的VQA难题。	depth estimation egocentric egocentric vision
25	Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow	Flow4Agent：利用光流运动先验进行长视频理解，提升MLLM性能。	optical flow large language model multimodal
26	When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach	提出一种多模态自动视频编辑方法，用于古典音乐会多机位录制视频的剪辑。	scene understanding multimodal
27	ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars	ArchitectHead：提出首个支持连续细节层次控制的3D高斯头部头像框架	3D gaussian splatting 3DGS gaussian splatting
28	Teleportraits: Training-Free People Insertion into Any Scene	Teleportraits：提出一种免训练的人物插入方法，实现任意场景下的人物合成	affordance classifier-free guidance affordance-aware
29	Human Action Recognition from Point Clouds over Time	提出一种基于点云序列和稀疏卷积网络的3D人体动作识别方法	depth estimation monocular depth
30	Dropping the D: RGB-D SLAM Without the Depth Sensor	DropD-SLAM：无需深度传感器的单目RGB SLAM，达到RGB-D级别精度	metric depth

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images	提出基于扩散模型的双手3D运动与姿态预测方法，提升日常图像中的预测精度。	bi-manual multimodal
32	HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video	HoloScene：从单视频重建可交互、可仿真的3D场景	manipulation scene understanding	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation	Text2Interact：提出高保真、多样化的文本驱动双人互动生成框架	two-person interaction spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页