cs.CV(2025-04-24)

📊 共 31 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱一:机器人控制 (Robot Control) (2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation 提出TimeSoccer,一个端到端多模态大语言模型,用于足球赛事解说生成。 large language model multimodal TAMP
2 TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection 提出TRACE框架以解决社交媒体恶意内容检测问题 multimodal visual grounding
3 Token Sequence Compression for Efficient Multimodal Computing 提出基于聚类级别token聚合的视觉token压缩方法,提升多模态计算效率。 multimodal
4 Plasma State Monitoring and Disruption Characterization using Multimodal VAEs 提出基于多模态VAE的等离子体状态监测与破裂特征分析方法。 multimodal
5 Hierarchical and Multimodal Data for Daily Activity Understanding DARai:用于日常活动理解的分层多模态数据集,支持反事实活动分析。 multimodal
6 Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models 针对地理空间基础模型,提出更智能而非更费力的参数高效微调方法。 foundation model
7 Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency 提出VCBENCH基准,评估LVLM在显式视觉依赖的多模态数学推理能力 multimodal
8 MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding 提出MASR框架,通过多模态分层注意力自反思推理提升Agent视频理解能力 multimodal
9 FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model FashionM3:基于统一视觉-语言模型的时尚多轮多任务助手 multimodal
10 Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models 提出Token-Shuffle,提升自回归模型在高分辨率图像生成中的效率与质量。 large language model multimodal
11 DiMeR: Disentangled Mesh Reconstruction Model DiMeR:提出解耦的网格重建模型,用于稀疏视角下的三维重建。 foundation model
12 FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding 提出FRAG:一种帧选择增强生成框架,用于长视频和长文档理解。 multimodal
13 TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos 提出TimeChat-Online,通过差分Token丢弃解决在线视频流冗余问题。 large language model
14 VEU-Bench: Towards Comprehensive Understanding of Video Editing 提出VEU-Bench,用于评估和提升视频大语言模型在视频编辑理解方面的能力。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
15 Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs 提出UniME框架,利用多模态LLM学习通用嵌入,提升跨模态检索性能 representation learning distillation large language model
16 DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model DPMambaIR:基于退化感知提示状态空间模型的全能图像复原 Mamba SSM state space model
17 A Genealogy of Foundation Models in Remote Sensing 综述遥感领域Foundation Model发展,探索多传感器融合与未来方向 representation learning foundation model
18 PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition 提出PhysioSync以解决EEG情感识别中的多模态同步问题 dreamer contrastive learning multimodal
19 DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition 提出DRC,通过解耦表征组合增强个性化图像生成,缓解指导崩溃问题。 representation learning large language model multimodal
20 Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation Mamba-Sea:基于Mamba和全局-局部序列增强的医学图像分割通用框架 Mamba state space model
21 StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies StereoMamba:面向机器人辅助微创手术的实时鲁棒立体视觉视差估计 Mamba MAE
22 STCL:Curriculum learning Strategies for deep learning image steganography models 提出STCL课程学习策略,提升深度学习图像隐写模型的性能与训练效率。 curriculum learning
23 Masked strategies for images with small objects 针对小目标图像,提出基于掩码策略的自监督学习方法,提升分割与分类性能。 masked autoencoder MAE

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
24 Casual3DHDR: Deblurring High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos 提出Casual3DHDR以解决高动态范围场景重建问题 3D gaussian splatting 3DGS gaussian splatting
25 The Fourth Monocular Depth Estimation Challenge 第四届单目深度估计挑战赛聚焦零样本泛化,提升了自然和室内环境下的深度估计精度。 depth estimation monocular depth Depth Anything
26 Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images 提出一种遮挡感知自监督单目深度估计方法,用于弱纹理内窥镜图像。 depth estimation monocular depth

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
27 EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor EgoCHARM:利用头戴IMU传感器实现资源高效的分层活动识别 egocentric
28 Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation 提出基于心智图像模拟的抽象视角变换框架,提升视觉-语言模型中的视角感知推理能力 egocentric foundation model

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
29 Towards Generalized and Training-Free Text-Guided Semantic Manipulation 提出GTF:一种通用、免训练的文本引导语义操控方法 manipulation
30 Step1X-Edit: A Practical Framework for General Image Editing Step1X-Edit:一种通用的实用图像编辑框架,性能媲美GPT-4o和Gemini2 Flash。 manipulation multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
31 PICO: Reconstructing 3D People In Contact with Objects PICO:重建与物体接触的3D人体,解决自然图像中HOI的深度歧义问题。 human-object interaction HOI SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页