WorldMemArena — Evaluating Multimodal Agent Memory

TL;DR

A benchmark that asks how memory is used, not just how much is remembered.

Existing benchmarks collapse memory into a single end-of-task accuracy and reduce visual observations to captions, hiding where failures actually occur. WorldMemArena reframes memory as an observable lifecycle across realistic agent interaction.

Action–World Loop

Each step couples observation, action, environment feedback, and memory — so memory is evaluated as a process, not a snapshot.

Lifecycle Diagnosis

Four stages — write, maintain, retrieve, use — annotated with gold memory points, updates, distractors, and evidence chains.

Two Regimes

Lifelong Evolution tracks evolving personal & task states. Agentic Execution grounds memory in real trajectories.

Head-to-Head

First unified comparison of long-context, RAG, external memory, and harness-based memory agents under one protocol.

Framework

Memory as an Action–World Interaction Loop.

At each step the agent observes a partially visible world, takes an action, receives feedback, and updates memory to support future decisions. Hover the nodes to see what each stage evaluates.

①

♠

Observe → Write

Identify future-useful evidence from observations, actions, screenshots, and feedback.

②

♠

Update & Consolidate

Revise stale facts, remove outdated entries, preserve a consistent world state over time.

③

♠

Retrieve for Decision

Surface the right evidence — by decision relevance, not just semantic similarity.

④

♠

Use & Act

Faithfully use retrieved evidence to answer queries or take environment actions.

Benchmark

400 multi-session tasks across Lifelong Evolution & Agentic Execution.

Lifelong Evolution focuses on evolving personal & project state. Agentic Execution places memory inside real GUI, embodied, and visual-agent trajectories — where evidence is distributed across actions, screenshots, and feedback.

Regime A

Lifelong Evolution

Hidden world states evolve across sessions; the system must consolidate, revise, and re-use long-term memory.

Professional verticals · academic, software, health, finance, education, startup
Holistic life course · evolving personal arcs, side arcs, interference

38 samples 684 sessions ~1.9k images

Regime B

Agentic Execution

Real agent trajectories where memory must be built from observations, tool feedback, and screenshots, not narration.

GUI Arena · excel, file mgmt, image edit, web, word docs
Embodied · ALFRED, navigation
VisualAgentBench · CSS, Minecraft, mobile, OmniGibson, WebArena-lite

362 samples ~7.8k sessions ~13.7k images

WorldMemArena Sample Showcase

Browse real samples by domain → subcategory → sample, with the agent's complete multi-session trajectory, gold memory points, and checkpoint QA.

Open Showcase

Data construction pipeline. — Unified pipeline: segment sessions → extract & revise gold memory points → build staged checkpoint QA.

Key Findings

Better storage is not the same as better memory.

A unified comparison of long-context, manually designed memory systems, and harness-based agents surfaces four consistent patterns.

01

Storage ≠ Use

High memory recall does not translate to QA correctness. Retrieval, not capacity, is the dominant bottleneck.

02

Multimodal still text-routed

Most systems compress visual evidence into captions and lose spatial, temporal, and procedural detail.

03

Agentic worlds are brittle

Performance degrades sharply on real GUI and embodied trajectories where evidence is distributed across actions.

04

Harness ≠ free lunch

Harness-based memory adapts better in hard regimes but remains expensive and less stable across backbones.

Lifecycle failures. — **Lifecycle compounds.** Early omissions reduce later evidence availability and contaminate future memory.

Domain heatmap. — **Domain gap.** Agentic tasks and complex multimodal QA create different memory pressures.

Token efficiency. — **Cost–accuracy tradeoff.** Harnesses shift the frontier, but at higher token & latency cost.

Open everything

Dive into the benchmark.

Inspect domain composition, modality pressure, retrieval pressure, and representative traces in the interactive explorer — or jump straight to the dataset.

Open Sample Showcase Hugging Face Dataset