EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
作者: Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal
分类: cs.CV, cs.AI, cs.CL
发布日期: 2026-05-11
备注: The first two authors contributed equally. Project website: https://egomemreason.github.io/
💡 一句话要点
提出EgoMemReason基准,旨在解决长周期第一人称视频理解中的记忆驱动推理挑战
🎯 匹配领域: 支柱六:视频提取与匹配 (Video Extraction) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 第一人称视频 长时序推理 多模态大模型 记忆驱动 具身智能 视频理解
📋 核心要点
- 现有长视频基准多聚焦于感知与识别,缺乏对跨数天、长时序证据整合与复杂逻辑推理能力的系统性评估。
- 提出EgoMemReason基准,通过实体、事件、行为三种记忆维度,构建包含500个问题的评估体系,模拟真实生活记录场景。
- 实验表明当前主流多模态大模型在长时序推理上表现欠佳,最高准确率仅39.6%,揭示了长程记忆处理的巨大技术鸿沟。
📝 摘要(中文)
下一代视觉助手(如智能眼镜、具身智能体)需具备跨越数天连续视觉体验的推理能力。在超长视频场景中,关键信息稀疏分布,模型面临记忆挑战:需积累信息、回溯状态、追踪时序并抽象模式。现有基准多侧重于感知与识别,缺乏跨多日的推理评估。为此,本文提出EgoMemReason基准,系统评估长周期第一人称视频的记忆驱动推理能力。该基准涵盖实体记忆(对象状态演变)、事件记忆(跨时段活动排序)与行为记忆(稀疏重复模式抽象)三种类型,包含500个问题,平均证据跨度达25.9小时。对17种多模态大模型及智能体框架的评估显示,最高准确率仅为39.6%,表明长时序记忆推理仍是当前多模态系统的核心瓶颈。
🔬 方法详解
问题定义:论文旨在解决长周期(周级别)第一人称视频理解中的记忆瓶颈问题。现有模型在处理跨度长达数小时甚至数天的稀疏信息时,难以有效整合证据、追踪对象状态演变及抽象长期行为模式。
核心思路:通过构建多维度的记忆推理基准,将长时序视频理解拆解为实体记忆、事件记忆和行为记忆三个核心任务,迫使模型从简单的感知任务转向深度的时空逻辑推理。
技术框架:EgoMemReason采用分层评估架构,通过构建包含500个复杂问题的测试集,要求模型在长达一周的视频流中进行信息检索、时序回溯与逻辑综合,评估模型在处理长上下文时的记忆保持与推理能力。
关键创新:首次系统性定义了长周期第一人称视频的三大记忆范式,并量化了证据跨度(平均25.9小时)与推理难度之间的关系,为长时序多模态系统的研发提供了明确的评价标准与挑战方向。
关键设计:该基准设计了六大核心挑战,涵盖了从对象状态追踪到复杂行为模式抽象的多种推理场景,通过多证据片段的关联性测试,严格考察模型在超长上下文窗口下的信息提取与综合能力。
🖼️ 关键图片
📊 实验亮点
实验评估了17种前沿多模态大模型(MLLMs)及智能体框架,结果显示即使是顶尖模型在EgoMemReason上的准确率也仅为39.6%。研究发现,随着时间跨度增加,模型性能显著下降,且不同记忆类型在推理失败原因上存在显著差异,有力证明了当前长时序记忆推理能力的严重不足。
🎯 应用场景
该研究直接服务于智能眼镜、可穿戴设备及具身智能体等领域。通过提升模型对长周期生活记录视频的理解能力,可实现更智能的个人生活助手,如自动回顾遗忘物品位置、总结长期生活习惯及辅助认知障碍患者进行日常活动记忆回溯,具有极高的实用价值。
📄 摘要(原文)
Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.