MemoryData

A Unified Memory Benchmark Suite for Memory-Augmented Agents

"One pipeline. Four benchmark families. Twenty-two method presets. One consistent execution interface."

Introduction • Features • Quick Start • Layout • Methods • Benchmarks • Config • Artifacts • FAQ • Citation

_{Main results from the accompanying paper: memory-augmented agent methods compared across the LongMemEval, LoCoMo, and DB-Bench benchmarks under exact-match, ROUGE-L, and LLM-judge metric families. Bars are grouped by paradigm — Reference Baselines, Sequential Context, Structural Topological, and Multi-Paradigm Hybrid.}

✨ Introduction

Memory-augmented agents, structured memory architectures, and retrieval-based baselines are usually evaluated in isolation — each paper ships its own loader, its own runtime adapter, and its own metric harness. Results are hard to compare, and reproducing a single number across two methods often means re-implementing both.

MemoryData closes that gap. It is a research-oriented benchmark suite that unifies four benchmark families (MemoryAgentBench, LoCoMo, LongBench, MemBench), twenty-two method presets, and a shared runtime under a single main.py launcher, so that heterogeneous memory formulations can be compared under one consistent execution interface and one stable artifact layout.

📚 Features

🚀 Unified Launcher `main.py` is the single entry point for benchmark execution, artifact writing, and optional post-run evaluation hooks. Select any method and any benchmark by pointing at two YAML files.	🧩 22 Method Presets Flattened YAML presets span reference baselines, sequential context, structural topological, and multi-paradigm hybrid architectures — each wired to its vendored runtime.
📊 4 Benchmark Families MemoryAgentBench, LoCoMo, LongBench, and MemBench, each with full and category-specific or slice-specific configs ready to run.	🗂 Consistent Taxonomy Methods are grouped following the paper's RQ1 effectiveness taxonomy, so presets are discoverable by paradigm instead of by filename.
📦 Structured Artifacts Every run emits a result JSON, persisted agent state, and optional logs under a stable, override-able `results/` root for reproducible post-processing.	🖥 Cross-Platform Separate dependency manifests for Linux/macOS and Windows, with BM25 and long-context reference paths retained under `utils/`.

🕹 Quick Start

Prerequisites: Python 3.11, an OpenAI-compatible model endpoint, and the benchmark datasets placed under datasets/.

Step 1: Create the environment

conda create -n memory-bench python=3.11
conda activate memory-bench

Platform	Command
Linux / macOS	`pip install -r requirements.txt`
Windows	`pip install -r requirements-windows.txt`

Step 2: Configure model endpoints and keys

Most presets assume OpenAI-compatible serving endpoints. Update the YAML files in config/ so that model, base_url, embedding_base_url, and related provider fields match the model servers available in your environment.

Variable	Used by	Notes
`OPENAI_API_KEY`	Most presets	Default key variable for chat and embedding calls
`OPENAI_API_BASE`	MemOS example environment	Refer to `methods/MemOS/config/.env.example` when using MemOS-specific setup

Step 3: Prepare datasets

Datasets are not bundled with this repository. Place them under datasets/ according to the loader expectations.

Benchmark	Default path	Format	Notes
MemoryAgentBench	`datasets/MemoryAgentBench/eval_dataset_collection/`	HuggingFace `save_to_disk` directory	Falls back to `ai-hyz/MemoryAgentBench` if the local copy is absent
LoCoMo	`datasets/LoCoMo/rq1_4cat_600_dist/locomo_4cat_600_dist.json`	JSON file	Used by the full and category-specific LoCoMo presets
LongBench	`datasets/longBench_rep150_proportional/datasets`	HuggingFace `save_to_disk` directory	Targets the proportional subset
MemBench	`datasets/MemBench/MemData/FirstAgent/*.json`	JSON files	`simple`, `noisy`, `knowledge_update`, `highlevel`, `RecMultiSession`

Reference layout:

datasets/
├── MemoryAgentBench/
│   └── eval_dataset_collection/          # HuggingFace save_to_disk directory
├── LoCoMo/
│   └── rq1_4cat_600_dist/
│       └── locomo_4cat_600_dist.json
├── longBench_rep150_proportional/
│   └── datasets/                         # HuggingFace save_to_disk directory
└── MemBench/
    └── MemData/FirstAgent/               # simple / noisy / knowledge_update / highlevel / RecMultiSession

Step 4: Run experiments

Command template:

python main.py --agent_config <agent_yaml> --dataset_config <dataset_yaml>

Representative runs:

Scenario	Agent config	Dataset config	Extra flags
Default MemoryAgentBench run	`config/reference_long_context_agent.yaml`	`benchmark/memoryagentbench/Accurate_Retrieval/config/EventQA/Eventqa_full.yaml`	-
Small smoke run	`config/reference_long_context_agent.yaml`	`benchmark/memoryagentbench/Accurate_Retrieval/config/EventQA/Eventqa_full.yaml`	`--max_test_queries_ablation 1`
LoCoMo evaluation	`config/hybrid_simplemem.yaml`	`benchmark/locomo/config/Locomo_qa_4cat_600_dist.yaml`	-
LongBench evaluation	`config/reference_embedding_rag.yaml`	`benchmark/longbench/config/LongBench_rep150_proportional.yaml`	-
MemBench evaluation	`config/sequential_mem0.yaml`	`benchmark/membench/config/MemBench_simple.yaml`	-

Example:

python main.py \
  --agent_config config/reference_long_context_agent.yaml \
  --dataset_config benchmark/memoryagentbench/Accurate_Retrieval/config/EventQA/Eventqa_full.yaml

🗂 Repository Layout

project-root/
├── main.py                        # unified experiment entry point
├── config/                        # flattened presets: reference, sequential, topological, hybrid
├── benchmark/
│   ├── memoryagentbench/          # MemoryAgentBench loaders and benchmark configs
│   ├── locomo/                    # LoCoMo configs and JSON loader
│   ├── longbench/                 # LongBench proportional-subset support
│   └── membench/                  # MemBench slice configs and loader
├── evaluation/
│   └── longmemeval/               # retained LongMemEval sidecar evaluation helpers
├── methods/                        # method runtimes grouped by the paper taxonomy
│   ├── embedding_rag/              # reference dense-retrieval baseline
│   ├── memagent/  mem0/  memochat/ # sequential context architectures
│   ├── cognee/  graph_rag/  hipporag/  memtree/  raptor/  zep/  zep_local/ # structural topological architectures
│   └── a_mem/  everos/  letta/  lightmem/  memorag/  memoryos/  self_rag/  simplemem/  MemOS/ # multi-paradigm hybrid architectures
├── utils/                          # shared runtime utilities, including long-context and BM25 reference paths
├── requirements.txt               # dependency manifest for Linux/macOS
└── requirements-windows.txt       # dependency manifest for Windows

🧠 Method Overview

The taxonomy below follows the grouping used in the main RQ1 effectiveness table of the accompanying paper. Methods retained in the released codebase but not displayed in that specific summary table are assigned to the corresponding taxonomy group for completeness.

Group	Method	Representative preset	Runtime entry	Notes
Reference Baselines	Long Context	`reference_long_context_agent.yaml`	`utils/agent.py`	Direct long-context answering baseline without an external memory store
Reference Baselines	Embedding RAG	`reference_embedding_rag.yaml`	`methods/embedding_rag/embedding_retriever.py`	Reference dense-retrieval baseline
Reference Baselines	BM25 RAG	`reference_simple_rag_bm25.yaml`	`utils/agent.py`	Sparse lexical retrieval baseline retained for comparison and smoke runs
Sequential Context Architectures	MemAgent	`sequential_memagent.yaml`	`methods/memagent/`	Recurrent sequential-memory baseline
Sequential Context Architectures	Mem0	`sequential_mem0.yaml`	`methods/mem0/source/mem0/`	Sequential memory storage with persistent structured state
Sequential Context Architectures	MemoChat	`sequential_memochat.yaml`	`methods/memochat/memochat_adapter.py`	Sequential dialogue memory with rolling summaries
Structural Topological Architectures	Cognee	`topological_cognee.yaml`	`methods/cognee/source/cognee/`	Graph-structured memory runtime
Structural Topological Architectures	Zep Local	`topological_zep_local.yaml`	`methods/zep_local/main.py`	Local graph-memory service path
Structural Topological Architectures	MemTree	`topological_memtree.yaml`	`methods/memtree/memtree_adapter.py`	Tree-structured memory organization with provenance
Structural Topological Architectures	GraphRAG	`topological_graph_rag.yaml`	`methods/graph_rag/graph_rag.py`	Structured graph-based retrieval baseline
Structural Topological Architectures	HippoRAG	`topological_hippo_rag_v2_openai.yaml`	`methods/hipporag/`	Retrieval over graph-style document organization
Structural Topological Architectures	RAPTOR	`topological_raptor.yaml`	`methods/raptor/raptor.py`	Hierarchical cluster-and-summarize retrieval baseline
Structural Topological Architectures	Zep	`topological_zep.yaml`	`methods/zep/zep.py`	Cloud-backed graph-memory integration
Multi-Paradigm Hybrid Architectures	Letta	`hybrid_letta.yaml`	`utils/agent.py`	Integrated through vendored Letta source and local runtime management
Multi-Paradigm Hybrid Architectures	LightMem	`hybrid_lightmem.yaml`	`methods/lightmem/lightmem_adapter.py`	Layered memory construction and retrieval
Multi-Paradigm Hybrid Architectures	SimpleMem	`hybrid_simplemem.yaml`	`methods/simplemem/simplemem_adapter.py`	Hybrid semantic, keyword, and structured retrieval
Multi-Paradigm Hybrid Architectures	MemOS	`hybrid_memos.yaml`	`methods/MemOS/source/src/`	Vendored memory operating system runtime
Multi-Paradigm Hybrid Architectures	MemoryOS	`hybrid_memoryos.yaml`	`methods/memoryos/memoryos_adapter.py`	Local runtime wrapper for the preserved MemoryOS implementation
Multi-Paradigm Hybrid Architectures	A-MEM	`hybrid_a_mem.yaml`	`methods/a_mem/a_mem_adapter.py`	Hybrid memory writing and retrieval with provenance tracking
Multi-Paradigm Hybrid Architectures	EverOS	`hybrid_everos.yaml`	`methods/everos/everos_adapter.py`	Search-oriented external memory runtime
Multi-Paradigm Hybrid Architectures	Self-RAG	`hybrid_self_rag.yaml`	`methods/self_rag/self_rag.py`	Retrieval-augmented generation baseline retained in the current code release
Multi-Paradigm Hybrid Architectures	MemoRAG	`hybrid_memo_rag.yaml`	`methods/memorag/`	Cache-heavy retrieval pipeline for long contexts

📊 Benchmark Overview

Benchmark family	Config files	Task focus	Expected input format
MemoryAgentBench / Accurate Retrieval	`benchmark/memoryagentbench/Accurate_Retrieval/config/EventQA/Eventqa_full.yaml` `benchmark/memoryagentbench/Accurate_Retrieval/config/LongMemEval/Longmemeval_s.yaml`	Question answering and long-memory retrieval under curated MemoryAgentBench splits	HuggingFace `save_to_disk` copy under `datasets/MemoryAgentBench/eval_dataset_collection/`, or fallback to `ai-hyz/MemoryAgentBench`
MemoryAgentBench / Conflict Resolution	`benchmark/memoryagentbench/Conflict_Resolution/config/Factconsolidation_mh_6k.yaml`	Resolving conflicting facts across long interaction histories	Same MemoryAgentBench loading path as above
MemoryAgentBench / Test-Time Learning	`benchmark/memoryagentbench/Test_Time_Learning/config/ICL/ICL_banking77.yaml`	In-context adaptation and label-space memorization	Same MemoryAgentBench loading path as above
LoCoMo	`benchmark/locomo/config/Locomo_qa_4cat_600_dist.yaml` `benchmark/locomo/config/Locomo_qa_4cat_600_dist_cat1_multi_hop.yaml` `benchmark/locomo/config/Locomo_qa_4cat_600_dist_cat2_temporal.yaml` `benchmark/locomo/config/Locomo_qa_4cat_600_dist_cat3_open_domain.yaml` `benchmark/locomo/config/Locomo_qa_4cat_600_dist_cat4_single_hop.yaml`	Conversational QA over long dialogues, with full and category-specific subsets	JSON file, typically `datasets/LoCoMo/rq1_4cat_600_dist/locomo_4cat_600_dist.json`
LongBench	`benchmark/longbench/config/LongBench_rep150_proportional.yaml`	Long-context multiple-choice reasoning on the proportional subset used by the current preset	HuggingFace `save_to_disk` directory, typically `datasets/longBench_rep150_proportional/datasets`
MemBench	`benchmark/membench/config/MemBench_simple.yaml` `benchmark/membench/config/MemBench_noisy.yaml` `benchmark/membench/config/MemBench_knowledge_update.yaml` `benchmark/membench/config/MemBench_highlevel.yaml` `benchmark/membench/config/MemBench_RecMultiSession.yaml`	Memory stress tests covering simple recall, noise, knowledge updates, high-level reasoning, and multi-session recommendation	Slice-specific JSON files under `datasets/MemBench/MemData/FirstAgent/`

⚙️ Configuration Conventions

Field	Meaning
`provider`	Chat-model backend type, typically `openai_compatible` in the default presets
`base_url`	Endpoint for the chat model server
`embedding_provider`	Backend type for embedding generation when the method uses vector retrieval
`embedding_base_url`	Endpoint for the embedding model server
`*_api_key_env`	Environment variable name used to resolve API keys at runtime
`retrieve_num`	Retrieval depth or top-`k` used by retrieval-enabled methods

📦 Output Artifacts

Artifact type	Default location	Description
Result JSON	`results/outputs/<model>/<dataset>/<name_tag>_results.json`	Main evaluation output with metrics, query-level records, and summary fields
Agent states	`results/agents/`	Persisted agent memory, retrieval caches, and method-specific state
Artifact root override	`--artifact_root /path/to/artifacts`	Rebases the outer artifact root while keeping the internal layout unchanged

Artifact layout:

results/
├── outputs/                                     # evaluation outputs grouped by model and dataset
│   └── <model>/                                 # model or preset-specific output namespace
│       └── <dataset>/                           # benchmark-specific output namespace
│           └── <name_tag>_results.json          # primary result file with metrics and records
├── agents/                                      # persisted agent state and method-side caches
│   └── <model_or_method>/                       # runtime-specific storage namespace
└── logs/                                        # optional execution logs when enabled by the run

Example:

python main.py \
  --agent_config config/reference_long_context_agent.yaml \
  --dataset_config benchmark/memoryagentbench/Accurate_Retrieval/config/EventQA/Eventqa_full.yaml \
  --artifact_root /path/to/artifacts

When --artifact_root is specified, the pipeline preserves the same internal results/outputs, results/agents, and results/logs organization under the new root, which makes it straightforward to isolate repeated experiment batches while keeping downstream parsing and post-processing logic unchanged.

🤔 FAQ

Are the datasets bundled with the repository?

No. Datasets are not distributed here. Place them under datasets/ following the paths in the Quick Start section. MemoryAgentBench additionally falls back to the ai-hyz/MemoryAgentBench HuggingFace mirror when no local copy is present.

Do I need to rebuild or reinstall anything between runs?

No. MemoryData is a plain Python pipeline launched via python main.py. Switching methods or benchmarks is just a matter of pointing --agent_config and --dataset_config at different YAML files.

Which model providers are supported?

The default presets target OpenAI-compatible chat and embedding endpoints, so any provider that exposes that interface works. Update base_url, embedding_base_url, and the relevant *_api_key_env fields in the chosen preset to match your server.

How do I force a clean re-run?

Pass --force to delete saved results, rebuild local agent state, and reset supported external persistence before the run. Use --retry_failed_queries to retry previously failed queries instead of skipping them when resuming.

📒 Citation

If you find this benchmark suite useful in your research, please cite:

@article{zhoumemorydata2026,
    title={Are We Ready For An Agent-Native Memory System?},
    author={Wei Zhou and Xuanhe Zhou and Shaokun Han and Hongming Xu and Guoliang Li and Zhiyu Li and Feiyu Xiong and Fan Wu},
    year={2026},
    journal={arXiv preprint arXiv:2606.24775},
    url={https://arxiv.org/abs/2606.24775}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmark		benchmark
config		config
evaluation/longmemeval		evaluation/longmemeval
methods		methods
utils		utils
.gitignore		.gitignore
Memory.png		Memory.png
MemoryData_overview.png		MemoryData_overview.png
README.md		README.md
main.py		main.py
requirements-windows.txt		requirements-windows.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MemoryData

✨ Introduction

📚 Features

🕹 Quick Start

Step 1: Create the environment

Step 2: Configure model endpoints and keys

Step 3: Prepare datasets

Step 4: Run experiments

🗂 Repository Layout

🧠 Method Overview

📊 Benchmark Overview

⚙️ Configuration Conventions

📦 Output Artifacts

🤔 FAQ

📒 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MemoryData

✨ Introduction

📚 Features

🕹 Quick Start

Step 1: Create the environment

Step 2: Configure model endpoints and keys

Step 3: Prepare datasets

Step 4: Run experiments

🗂 Repository Layout

🧠 Method Overview

📊 Benchmark Overview

⚙️ Configuration Conventions

📦 Output Artifacts

🤔 FAQ

📒 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages