H2HMEM

H2HMem

A Multimodal Memory Benchmark for Agents
in Human–Human Interactions

Evaluating memory recall, reasoning, and application across dyadic and multi-party multimodal conversations.

Comparison between Human-Assistant Interaction and Human-Human Interaction

Figure 1: Comparison between Human–Assistant Interaction and Human–Human Interaction.

Abstract

Large language model agents are increasingly deployed in human–human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human–assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants.

However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human–human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application.

Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.

Introduction

LLM agents are increasingly deployed as observers in human–human interaction settings. These emerging deployment environments introduce three fundamental challenges.

Multimodal Nature

Human–human conversations are inherently multimodal, naturally interleaving text with visual content such as shared photographs and screen captures.

Complex Discourse

Natural language exhibits complex phenomena—such as anaphora and discourse deixis—that require agents to resolve references against an evolving conversational memory.

Multiple Participants

Interactions often involve multiple participants who jointly shape the dialogue, contributing information asynchronously and at times presenting conflicting perspectives.

Unlike traditional human–assistant settings, where a single user directly interacts with an agent, human–human scenarios require agents to passively capture critical conversational information for subsequent querying. This capability underpins growing real-world applications, including clinical documentation systems that generate patient-centered notes from clinician–patient dialogues, AI-powered medical board meeting assistants processing multimodal inputs, and general meeting summarization systems. Robust multimodal memory is therefore essential.

However, existing memory benchmarks largely focus on single-user, text-only human–assistant interactions. Although recent efforts have begun exploring human–human conversations, they remain limited in scope: LoCoMo incorporates vision but is restricted to dyadic interactions and lacks a comprehensive evaluation framework, whereas others support multi-party settings but remain exclusively text-based. No existing benchmark adequately captures the full spectrum of human–human interactions—spanning both dyadic and multi-party settings—while enabling multimodal memory evaluation.

Key Contributions

Three core advances that set H2HMem apart from existing memory benchmarks.

1

Novel Benchmark

Introduce H2HMem, a benchmark for evaluating multimodal memory in realistic human–human observer scenarios, covering both dyadic and multi-party interactions.

2

Privacy-Preserving Pipeline

Construct a large-scale multimodal, multi-session dataset through a privacy-preserving human-in-the-loop pipeline that captures the evolving nature of real-world communication.

3

Comprehensive Evaluation

Propose a comprehensive evaluation taxonomy spanning recall, reasoning, and application, revealing key limitations of current MLLMs in cross-modal memory alignment and structured reasoning.

H2HMem Benchmark

A human-in-the-loop generation pipeline for constructing multimodal, multi-session, and multi-participant interactions under an online conversational setting.

0
Dialogues
0
Sessions
0
Dialogue Rounds
0
Images
0
QA Pairs
0
Task Types

Dataset Construction Pipeline

Dataset construction pipeline of H2HMem

Figure 2: Dataset construction pipeline of H2HMem.

Pipeline Stages in Detail

1

Participant Profile Generation

Define a structured schema for participant profiles including personality, background, and communication style. Conditioned on this schema, employ DeepSeek-V3 to generate structured participant profiles for both dyadic (2 profiles) and multi-party (4–6 profiles) dialogues.

2

Scenario Construction

Summarize eleven common conversational topics. Given participant profiles, prompt the LLM to sample topics and generate multiple session-level outlines, each describing a session’s local events. These sessions are temporally ordered, forming a coherent multi-session scenario. The LLM also generates image retrieval keywords for visual content collection.

3

Image Collection & Human Refinement

Retrieve images through online search, supplementing with text-to-image generation and manual creation/editing based on keywords. Six annotators then filter and refine pictures to align images with outlines—checking visual content match, image quality (at least 224×224px), and topical appropriateness. Approximately 80 person-hours for image refinement.

4

Image Captioning & Dialogue Generation

Dialogues are generated using DeepSeek-V3, conditioned on participant profiles, session outlines, and images. Since DeepSeek-V3 cannot process images directly, detailed captions are generated via GPT-4o. The agent generates dialogues and refers to images using numeric identifiers, which are then replaced with actual images.

5

Question-Answer Pair Construction

Use DeepSeek-V3 to generate a diverse set of questions targeting different memory capabilities (recall, reasoning, application). Visual information is replaced with captions during generation. Generated QA pairs are refined by human annotators to ensure clarity, correctness, and appropriate difficulty. Approximately 40 person-hours for QA validation.

Online Conversational Setting

H2HMem focuses on online conversational environments, where interactions occur via temporally ordered messages, allowing asynchronous participation (as in social media or messaging platforms). This setting offers three key advantages: strong ecological validity, structured information flow, and support for diverse topics and participants yielding richer conversational dynamics.

Interaction Types

Dyadic Interactions

20 dialogues with 2 participants each

Average 14.2 sessions per dialogue

Average 18.7 rounds per session

Longer time horizons, evolving relationships

Multi-Party Interactions

5 dialogues with 4–6 participants each

Average 5.0 sessions per dialogue

Average 70.5 rounds per session

Denser interactions, conflicting perspectives

Task Taxonomy

A hierarchical taxonomy of nine task types organized into three categories, providing a comprehensive framework for memory evaluation.

Task taxonomy with question distribution and examples

Figure 3: Question type distribution (a) and definition with examples (b) for each task type.

Memory Recall

Evaluates whether models can retrieve explicitly presented multimodal information.

UPRUnimodal Precise Recall

Retrieve information from a single modality (text or image).

CRRCross-modal Related Retrieval

Retrieve aligned content across modalities (text↔image).

KRKnowledge Resolution

Retrieve currently correct information after updates across sessions.

Memory Reasoning

Evaluates higher-level inference over multimodal information across time and participants.

TRTemporal Reasoning

Order events across sessions using timestamps and utterance positions.

MCRMultimodal Causal Reasoning

Infer causal relations between textual and visual content across sessions.

RETReference & Evolution Tracking

Resolve references and track entity evolution across sessions and speakers.

Memory Application

Evaluates how models apply and update memory during inference.

TTLTest-Time Learning

Adapt to new scenarios at inference time using accumulated memory.

CDConflict Detection

Detect whether a new statement contradicts existing memory.

ARAnswer Refusal

Refuse to answer when information is absent or cannot be inferred.

Experiments

Comprehensive evaluation of text-based and multimodal memory methods on H2HMem, revealing key bottlenecks and interaction-structure effects.

Experimental Setup

Text-based Methods:Full Memory (Text), NaiveRAG, A-Mem
Multimodal Methods:Full Memory (MM), MuRAG, NGM
Backbone Models:Qwen2.5-VL (3B & 7B Instruct), GPT-4.1-Nano
Evaluation Metric:LLM-as-Judge (GPT-4o-mini, κ=0.84), plus Precision/Recall/F1/BLEU-1
Retrieval:Dense retriever with default top-K=5

Key Findings

01

Cross-modal Alignment Remains Challenging

A consistent gap exists between UPR and CRR. MuRAG drops from 0.6346 to 0.5326 in LLM-as-Judge scores when crossing modalities.

02

Weak Distractor Filtering Despite Successful Retrieval

A large recall–precision gap is observed. A-Mem achieves 0.4215 recall but only 0.2206 precision, indicating difficulty filtering noisy multi-participant information.

03

Limited Causal Reasoning & Referential Conventions

Reasoning tasks (MCR, RET) show the lowest scores. Near-zero BLEU-1 indicates models rarely reproduce precise factual phrasing needed to connect distributed evidence.

04

Poor Robustness to Conflicting Information

Conflict Detection remains particularly difficult with near-zero lexical precision and recall, highlighting the inability to resolve contradictions in human–human interactions.

Dyadic vs. Multi-party Impact

Dyadic dialogues span longer time horizons with more sessions (avg. 14.2 sessions), whereas multi-party dialogues contain denser interactions within fewer sessions (avg. 70.5 rounds/session and 5.0 sessions). This leads to complementary performance patterns:

Consistency-oriented tasks (KR, CD) are substantially harder in multi-party settings due to contradictory signals from multiple speakers. NaiveRAG's KR drops from 0.4896 (dyadic) to 0.2500 (multi-party).

Context-concentrated tasks (CRR, TTL) achieve comparable or higher performance in multi-party settings. Parameter scaling alone does not eliminate this gap, indicating current memory mechanisms remain insufficiently robust.

LLM-as-Judge Performance (Weighted Average)

Recall
Reasoning
Application
Full (Text)
0.35
0.30
0.49
NaiveRAG
0.47
0.38
0.58
A-Mem
0.60
0.43
0.64
Full (MM)
0.39
0.33
0.52
MuRAG
0.56
0.41
0.63
NGM
0.48
0.40
0.64

Error Archetype Distribution

Manual analysis of 100 failed cross-modal and reasoning instances from three multimodal methods, categorized into four archetypes.

Error TypeFull (MM)MuRAGNGM
Modal Misalignment
48%
44%
46%
Speaker-related Errors
37%
35%
32%
Temporal Confusion
15%
16%
9%
Other / Hallucination
5%
5%
6%

Efficiency Trade-offs

Tracking multimodal human\u2013human interactions imposes substantial computational burdens. A clear trade-off exists between storage and inference latency.

MethodStorage (s/sess)Retrieval (s/q)Answer (s/q)
Full (Text)0.00150.1617.99
NaiveRAG0.691.3710.06
A-Mem351.080.024.57
Full (MM)0.00090.3626.09
MuRAG9.861.4712.64
NGM6.530.774.33

Case Studies

Case studies of multimodal conversational reasoning

Figure 4: Case studies of multimodal conversational reasoning. (a) Identifying ingredients in Lu Zhixing's recipe. (b) Inferring Lin Chang'an's conclusion based on a shared menu.

Conclusion

H2HMem provides a unified framework for evaluating multimodal memory in LLM agents within human–human interactions, assessing memory recall, reasoning, and application.

Experiments show that current methods can retrieve relevant information but remain weak at integrating it. They can recall fragments—images, facts, statements—but fail to align visual evidence with text, attribute information to the correct speaker across sessions, or resolve contradictions from multiple sources.

In multimodal human–human interactions, remembering fragments is not enough—agents must reconstruct multimodal coherent memory from distributed human communications.

Limitations: The dataset is synthetically generated with human-in-the-loop, of modest scale (25 dialogues, 2,236 QA pairs), and limited to English. Only a subset of memory methods and MLLM backbones is evaluated. Despite these limitations, H2HMem provides a useful foundation for future research.