Sample Showcase

WorldMemArena
sample showcase.

See what the benchmark actually looks like. 54 real samples are shipped — two per subcategory, each with its full multi-session trajectory. Pick a domain, then a subcategory, then a sample card. The viewer below renders the agent's complete trajectory: observations, actions, screenshots, gold memory points, and checkpoint QA — all loaded directly from the released benchmark JSON and images.

Domain

Subcategory

Sample

How the browser is organized

Domain — the four major regimes: Agent · GUI, Agent · Embodied, Lifelong · Project, Lifelong · Personal.
Subcategory — within each domain (e.g. excel, omnigibson, academic). 27 subcategories in total.
Sample — two real samples per subcategory, each with its full multi-session trajectory. Click a card to load it.
Session navigator — every session of the loaded sample is here. Click a session to see its observations, actions, and gold memory points.

How to read a session

Left column — agentic samples show Obs / Act / Action per step with the screenshot; lifelong samples show the role-tagged dialogue stream with inline images.
Right column — gold Memory Points annotated for this session. Yellow = the MP is cited as evidence by a checkpoint QA.
Sample QA — checkpoint questions over the trajectory; the evidence chips reference the mp_* ids visible in the sidebar.