A Multimodal Memory Benchmark for Agents
in Human–Human Interactions
Evaluating memory recall, reasoning, and application across dyadic and multi-party multimodal conversations.

Figure 1: Comparison between Human–Assistant Interaction and Human–Human Interaction.
Large language model agents are increasingly deployed in human–human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human–assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants.
However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human–human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application.
Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.
LLM agents are increasingly deployed as observers in human–human interaction settings. These emerging deployment environments introduce three fundamental challenges.
Human–human conversations are inherently multimodal, naturally interleaving text with visual content such as shared photographs and screen captures.
Natural language exhibits complex phenomena—such as anaphora and discourse deixis—that require agents to resolve references against an evolving conversational memory.
Interactions often involve multiple participants who jointly shape the dialogue, contributing information asynchronously and at times presenting conflicting perspectives.
Unlike traditional human–assistant settings, where a single user directly interacts with an agent, human–human scenarios require agents to passively capture critical conversational information for subsequent querying. This capability underpins growing real-world applications, including clinical documentation systems that generate patient-centered notes from clinician–patient dialogues, AI-powered medical board meeting assistants processing multimodal inputs, and general meeting summarization systems. Robust multimodal memory is therefore essential.
However, existing memory benchmarks largely focus on single-user, text-only human–assistant interactions. Although recent efforts have begun exploring human–human conversations, they remain limited in scope: LoCoMo incorporates vision but is restricted to dyadic interactions and lacks a comprehensive evaluation framework, whereas others support multi-party settings but remain exclusively text-based. No existing benchmark adequately captures the full spectrum of human–human interactions—spanning both dyadic and multi-party settings—while enabling multimodal memory evaluation.
Three core advances that set H2HMem apart from existing memory benchmarks.
Introduce H2HMem, a benchmark for evaluating multimodal memory in realistic human–human observer scenarios, covering both dyadic and multi-party interactions.
Construct a large-scale multimodal, multi-session dataset through a privacy-preserving human-in-the-loop pipeline that captures the evolving nature of real-world communication.
Propose a comprehensive evaluation taxonomy spanning recall, reasoning, and application, revealing key limitations of current MLLMs in cross-modal memory alignment and structured reasoning.
A human-in-the-loop generation pipeline for constructing multimodal, multi-session, and multi-participant interactions under an online conversational setting.

Figure 2: Dataset construction pipeline of H2HMem.
Define a structured schema for participant profiles including personality, background, and communication style. Conditioned on this schema, employ DeepSeek-V3 to generate structured participant profiles for both dyadic (2 profiles) and multi-party (4–6 profiles) dialogues.
Summarize eleven common conversational topics. Given participant profiles, prompt the LLM to sample topics and generate multiple session-level outlines, each describing a session’s local events. These sessions are temporally ordered, forming a coherent multi-session scenario. The LLM also generates image retrieval keywords for visual content collection.
Retrieve images through online search, supplementing with text-to-image generation and manual creation/editing based on keywords. Six annotators then filter and refine pictures to align images with outlines—checking visual content match, image quality (at least 224×224px), and topical appropriateness. Approximately 80 person-hours for image refinement.
Dialogues are generated using DeepSeek-V3, conditioned on participant profiles, session outlines, and images. Since DeepSeek-V3 cannot process images directly, detailed captions are generated via GPT-4o. The agent generates dialogues and refers to images using numeric identifiers, which are then replaced with actual images.
Use DeepSeek-V3 to generate a diverse set of questions targeting different memory capabilities (recall, reasoning, application). Visual information is replaced with captions during generation. Generated QA pairs are refined by human annotators to ensure clarity, correctness, and appropriate difficulty. Approximately 40 person-hours for QA validation.
H2HMem focuses on online conversational environments, where interactions occur via temporally ordered messages, allowing asynchronous participation (as in social media or messaging platforms). This setting offers three key advantages: strong ecological validity, structured information flow, and support for diverse topics and participants yielding richer conversational dynamics.
20 dialogues with 2 participants each
Average 14.2 sessions per dialogue
Average 18.7 rounds per session
Longer time horizons, evolving relationships
5 dialogues with 4–6 participants each
Average 5.0 sessions per dialogue
Average 70.5 rounds per session
Denser interactions, conflicting perspectives
A hierarchical taxonomy of nine task types organized into three categories, providing a comprehensive framework for memory evaluation.

Figure 3: Question type distribution (a) and definition with examples (b) for each task type.
Evaluates whether models can retrieve explicitly presented multimodal information.
Retrieve information from a single modality (text or image).
Retrieve aligned content across modalities (text↔image).
Retrieve currently correct information after updates across sessions.
Evaluates higher-level inference over multimodal information across time and participants.
Order events across sessions using timestamps and utterance positions.
Infer causal relations between textual and visual content across sessions.
Resolve references and track entity evolution across sessions and speakers.
Evaluates how models apply and update memory during inference.
Adapt to new scenarios at inference time using accumulated memory.
Detect whether a new statement contradicts existing memory.
Refuse to answer when information is absent or cannot be inferred.
Comprehensive evaluation of text-based and multimodal memory methods on H2HMem, revealing key bottlenecks and interaction-structure effects.
A consistent gap exists between UPR and CRR. MuRAG drops from 0.6346 to 0.5326 in LLM-as-Judge scores when crossing modalities.
A large recall–precision gap is observed. A-Mem achieves 0.4215 recall but only 0.2206 precision, indicating difficulty filtering noisy multi-participant information.
Reasoning tasks (MCR, RET) show the lowest scores. Near-zero BLEU-1 indicates models rarely reproduce precise factual phrasing needed to connect distributed evidence.
Conflict Detection remains particularly difficult with near-zero lexical precision and recall, highlighting the inability to resolve contradictions in human–human interactions.
Dyadic dialogues span longer time horizons with more sessions (avg. 14.2 sessions), whereas multi-party dialogues contain denser interactions within fewer sessions (avg. 70.5 rounds/session and 5.0 sessions). This leads to complementary performance patterns:
Consistency-oriented tasks (KR, CD) are substantially harder in multi-party settings due to contradictory signals from multiple speakers. NaiveRAG's KR drops from 0.4896 (dyadic) to 0.2500 (multi-party).
Context-concentrated tasks (CRR, TTL) achieve comparable or higher performance in multi-party settings. Parameter scaling alone does not eliminate this gap, indicating current memory mechanisms remain insufficiently robust.
Manual analysis of 100 failed cross-modal and reasoning instances from three multimodal methods, categorized into four archetypes.
| Error Type | Full (MM) | MuRAG | NGM |
|---|---|---|---|
| Modal Misalignment | 48% | 44% | 46% |
| Speaker-related Errors | 37% | 35% | 32% |
| Temporal Confusion | 15% | 16% | 9% |
| Other / Hallucination | 5% | 5% | 6% |
Tracking multimodal human\u2013human interactions imposes substantial computational burdens. A clear trade-off exists between storage and inference latency.
| Method | Storage (s/sess) | Retrieval (s/q) | Answer (s/q) |
|---|---|---|---|
| Full (Text) | 0.0015 | 0.16 | 17.99 |
| NaiveRAG | 0.69 | 1.37 | 10.06 |
| A-Mem | 351.08 | 0.02 | 4.57 |
| Full (MM) | 0.0009 | 0.36 | 26.09 |
| MuRAG | 9.86 | 1.47 | 12.64 |
| NGM | 6.53 | 0.77 | 4.33 |

Figure 4: Case studies of multimodal conversational reasoning. (a) Identifying ingredients in Lu Zhixing's recipe. (b) Inferring Lin Chang'an's conclusion based on a shared menu.
H2HMem provides a unified framework for evaluating multimodal memory in LLM agents within human–human interactions, assessing memory recall, reasoning, and application.
Experiments show that current methods can retrieve relevant information but remain weak at integrating it. They can recall fragments—images, facts, statements—but fail to align visual evidence with text, attribute information to the correct speaker across sessions, or resolve contradictions from multiple sources.
In multimodal human–human interactions, remembering fragments is not enough—agents must reconstruct multimodal coherent memory from distributed human communications.
Limitations: The dataset is synthetically generated with human-in-the-loop, of modest scale (25 dialogues, 2,236 QA pairs), and limited to English. Only a subset of memory methods and MLLM backbones is evaluated. Despite these limitations, H2HMem provides a useful foundation for future research.