Multimodal Agent Memory Benchmark · 2026

WorldMemArena Evaluating Multimodal Agent Memory
Through Action–World Interaction

Memory must do more than recall. It must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. WorldMemArena reframes agent memory as a four-stage lifecycle — write, maintain, retrieve, use — across 400 multi-session multimodal tasks.

Chengzhi Liu1 Yuzhe Yang1 Sophia Xiao Pu1 Yepeng Liu1 Lin Long5 Yichen Guo1 Nuo Chen6 Zhaotian Weng1 Elena Kochkina2 Simerjot Kaur2 Charese Smiley2 Xiaomo Liu2 James Zou4 Sheng Liu4 Yuheng Bu1 Songyou Peng3 Xin Eric Wang1
1UC Santa Barbara 2J.P. Morgan Chase 3ETH Zurich 4Stanford 5JHU 6CMU
WorldMemArena: Action–World Interaction Loop overview
0
multi-session tasks
0
checkpoint QA pairs
0
images & screenshots
0
interaction steps
0
evaluation dimensions
TL;DR

A benchmark that asks how memory is used, not just how much is remembered.

Existing benchmarks collapse memory into a single end-of-task accuracy and reduce visual observations to captions, hiding where failures actually occur. WorldMemArena reframes memory as an observable lifecycle across realistic agent interaction.

Action–World Loop

Each step couples observation, action, environment feedback, and memory — so memory is evaluated as a process, not a snapshot.

Lifecycle Diagnosis

Four stages — write, maintain, retrieve, use — annotated with gold memory points, updates, distractors, and evidence chains.

Two Regimes

Lifelong Evolution tracks evolving personal & task states. Agentic Execution grounds memory in real trajectories.

Head-to-Head

First unified comparison of long-context, RAG, external memory, and harness-based memory agents under one protocol.

Framework

Memory as an Action–World Interaction Loop.

At each step the agent observes a partially visible world, takes an action, receives feedback, and updates memory to support future decisions. Hover the nodes to see what each stage evaluates.

Observe → Write

Identify future-useful evidence from observations, actions, screenshots, and feedback.

Update & Consolidate

Revise stale facts, remove outdated entries, preserve a consistent world state over time.

Retrieve for Decision

Surface the right evidence — by decision relevance, not just semantic similarity.

Use & Act

Faithfully use retrieved evidence to answer queries or take environment actions.

WorldMemArena framework figure
Benchmark

400 multi-session tasks across Lifelong Evolution & Agentic Execution.

Lifelong Evolution focuses on evolving personal & project state. Agentic Execution places memory inside real GUI, embodied, and visual-agent trajectories — where evidence is distributed across actions, screenshots, and feedback.

Regime A

Lifelong Evolution

Hidden world states evolve across sessions; the system must consolidate, revise, and re-use long-term memory.

  • Professional verticals · academic, software, health, finance, education, startup
  • Holistic life course · evolving personal arcs, side arcs, interference
38 samples 684 sessions ~1.9k images
Regime B

Agentic Execution

Real agent trajectories where memory must be built from observations, tool feedback, and screenshots, not narration.

  • GUI Arena · excel, file mgmt, image edit, web, word docs
  • Embodied · ALFRED, navigation
  • VisualAgentBench · CSS, Minecraft, mobile, OmniGibson, WebArena-lite
362 samples ~7.8k sessions ~13.7k images

WorldMemArena Sample Showcase

Browse real samples by domain → subcategory → sample, with the agent's complete multi-session trajectory, gold memory points, and checkpoint QA.

Open Showcase
Data construction pipeline.
Unified pipeline: segment sessions → extract & revise gold memory points → build staged checkpoint QA.
Key Findings

Better storage is not the same as better memory.

A unified comparison of long-context, manually designed memory systems, and harness-based agents surfaces four consistent patterns.

01

Storage ≠ Use

High memory recall does not translate to QA correctness. Retrieval, not capacity, is the dominant bottleneck.

02

Multimodal still text-routed

Most systems compress visual evidence into captions and lose spatial, temporal, and procedural detail.

03

Agentic worlds are brittle

Performance degrades sharply on real GUI and embodied trajectories where evidence is distributed across actions.

04

Harness ≠ free lunch

Harness-based memory adapts better in hard regimes but remains expensive and less stable across backbones.

Lifecycle failures.
Lifecycle compounds. Early omissions reduce later evidence availability and contaminate future memory.
Domain heatmap.
Domain gap. Agentic tasks and complex multimodal QA create different memory pressures.
Token efficiency.
Cost–accuracy tradeoff. Harnesses shift the frontier, but at higher token & latency cost.
Open everything

Dive into the benchmark.

Inspect domain composition, modality pressure, retrieval pressure, and representative traces in the interactive explorer — or jump straight to the dataset.