Zep Is The New State of the Art In Agent Memory

Setting a new standard for agent memory with up to 100% accuracy gains and 90% lower latency.

Zep Is The New State of the Art In Agent Memory

With OpenAI's o-series models and advancements from other vendors suggesting the near-term emergence of agents capable of solving highly complex, Ph.D.-level problems, we need to rethink how these agents will access critical information. As agents become pervasive in our daily lives, they'll need access to a vast collection of continuously evolving data spanning user interactions, business operations, and world events.

This data universe will only expand. While we've seen rapid increases in LLM context window sizes and improved recall capabilities, our research shows that recall remains challenging. Even when solved, the computational demands of processing multi-million-token context windows—along with the resulting latency and power consumption—will remain impractical for many real-world use cases.

Zep is a temporal knowledge graph-based memory layer for AI agents that continuously learns from user interactions and changing business data. By providing agents with a complete, holistic view of each user, Zep enables developers to build applications that tackle complex, personalized tasks.

Redefining State of the Art Agent Memory

In research published today, we demonstrate that Zep outperforms the current state-of-the-art memory system, MemGPT (Letta AI), in the Deep Memory Retrieval (DMR) benchmark—the primary evaluation metric established by the Letta/MemGPT team. However, recent advances in LLM capabilities have revealed the DMR benchmark's limitations in assessing real-world performance.

Deep Memory Retrieval Model: GPT-4-Turbo Zep MemGPT† Recursive Summarization† 94.8% 93.4% 35.3% † Results reported in the MemGPT paper

More significantly, Zep excels in the LongMemEval benchmark, a comprehensive and challenging chat history memory evaluation that better reflects real-world enterprise use cases. In this benchmark, Zep delivers aggregate accuracy improvements of up to 18.5%, with individual evaluations showing gains exceeding 100% compared to using the full chat transcript in the context window, all while reducing response latency by 90%.

LongMemEval Accuracy GPT-4o GPT-4o-mini Zep Full-context Zep Full-context 71.2% 60.2% 63.8% 55.4%

Notably, Zep's performance advantage over the "needle in a haystack" baseline was more pronounced when paired with gpt-4o as agent LLM compared to the smaller gpt-4o-mini model. While the baseline saw improvement when comparing gpt-4o-mini to gpt-4o, the larger model leveraged Zep's temporal context more effectively.

Median Latency GPT-4o GPT-4o-mini Zep Full-context Zep Full-context 2.58s 28.9s 3.20s 31.3s

The compute costs of placing full chat transcripts into the context window are very real. Evaluation with Zep utilized, on average, less than 2% of the baseline tokens while offering latency reductions of an order of magnitude.

Memory Model Median Latency Latency IQR Context Tokens
Zep GPT-4o 2.58s 0.684s 1.6k
Zep GPT-4o-mini 3.20s 1.31s 1.6k
Full-context GPT-4o 28.9s 6.01s 115k
Full-context GPT-4o-mini 31.3s 8.76s 115k

How Zep Works

Zep automatically constructs a temporal knowledge graph from agent user interactions and evolving business data. This knowledge graph contains entities, relationships, and facts related to the user and business context. Zep takes inspiration from GraphRAG, but unlike GraphRAG, Zep maintains a rich temporal understanding of how information evolves over time. As facts change or are superseded, the graph is updated to reflect the new state.

Graphiti is the underlying technology behind Zep’s memory layer. It is an open source library that builds dynamic, temporally-aware knowledge graphs that represent complex, evolving relationships between entities over time. It ingests both unstructured and structured data, and the resulting graph may be queried using a fusion of time, full-text, semantic, and graph algorithm approaches.

❤️
We're hiring! Join a high-agency team building at the forefront of AI. Learn more.

The Limits of the Deep Memory Retrieval Evaluation

Unlike the MemGPT team's approach of using recursive transcript summarizations for their baseline evaluation, we tested the Deep Memory Retrieval (DMR) benchmark by including the complete evaluation conversation in the context window.

DMR Scores with Full Conversation in Context Window GPT-4-Turbo GPT-4o-mini Zep Full-conversation Zep Full-conversation MemGPT† 98.2% 98.0% 94.8% 94.4% 93.4% † Results reported in the MemGPT paper

Both gpt-4o-mini and the legacy gpt-4-turbo model achieved exceptional scores, surpassing MemGPT's reported results. This indicated that the test transcripts' length and complexity weren't challenging enough to properly evaluate contemporary LLMs or the complex problems agents will need to solve.

LongMemEval: A high-quality, challenging conversational dataset

The LongMemEval dataset consists of 500 human-curated question-answer pairs, with answers embedded within a scalable set of user-assistant chat histories. The dataset is designed to test beyond simple fact recall with many tasks requiring complex temporal reasoning.

Question Which three events happened in the order from first to last: the day I helped
my friend prepare the nursery, the day I helped my cousin pick out stuff for
her baby shower, and the day I ordered a customized phone case for my
friend's birthday?
Golden
Answer
First, I helped my friend prepare the nursery, then I helped my cousin pick
out stuff for her baby shower, and lastly, I ordered a customized phone case
for my friend's birthday.
gpt-4o-mini
full context
(incorrect)
Based on the context provided, the order of events from first to last is as
follows:
1. The day you helped your cousin pick out stuff for her baby shower.
2. The day you helped your friend prepare the nursery.
3. The day you ordered a customized phone case for your friend's birthday.
Zep + gpt-4o-mini
(correct)
The events happened in the following order from first to last:
1. You helped your friend prepare the nursery (2023-02-05).
2. You helped your cousin pick out items for her baby shower (2023-02-10).
3. You ordered a customized phone case for your friend's birthday
(2023-02-20).

Due to gpt-4o's 128,000 token context window limitation, we chose to use the LONGMEMEVAL_S dataset—containing approximately 115,000 tokens per problem—rather than the larger LONGMEMEVAL_M dataset with 1,500,000 tokens per problem.

0:00
/0:15

Sample LongMemEval Chat History Captured as a Zep Graph

Our Evaluation Methodology

We simplified the LongMemEval team’s proposed test methodology, where, similar to our work with the DMR benchmark, we baselined against the full chat history placed into the context window.

Baseline Evaluation Methodology

When working with Zep, we first ingested the chat history datasets using Zep's built-in message load capabilities. gpt-4o-mini was used internally by the Zep service to construct the knowledge graph.

To evaluate Zep's performance, we queried Zep using the test question unaltered, with Zep returning relevant memory in the form of a Context field. This was then added to the prompt alongside the question, and passed into the LLM.

Zep Retrieval and Evaluation Methodology

An LLM-as-judge was then used to evaluate the result against the Golden. Further detail may be found in the paper.

Excelling at Enterprise-Critical Tasks

Zep's ability to maintain multiple temporal versions of facts, trace the lineage of information changes, and automatically construct a historical narrative of how knowledge has evolved makes it a distinctly different approach from both the current state of the art in agent memory, as well as advanced approaches to RAG.

As evidenced below, relevant results from Zep's temporal knowledge graph enable agents to reason about causality, track the evolution of ideas and facts, and understand the context of changes—capabilities that go far beyond simple fact retrieval.

Question Type Model Full-context Zep Delta
single-session-preference gpt-4o-mini 30.0% 53.3% 77.7%↑
single-session-assistant gpt-4o-mini 81.8% 75.0% 9.06%↓
temporal-reasoning gpt-4o-mini 36.5% 54.1% 48.2%↑
multi-session gpt-4o-mini 40.6% 47.4% 16.7%↑
knowledge-update gpt-4o-mini 76.9% 74.4% 3.36%↓
single-session-user gpt-4o-mini 81.4% 92.9% 14.1%↑
----------------------------- -------------- -------------- ---------- -------------------------
single-session-preference gpt-4o 20.0% 56.7% 184%↑
single-session-assistant gpt-4o 94.6% 80.4% 17.7%↓
temporal-reasoning gpt-4o 45.1% 62.4% 38.4%↑
multi-session gpt-4o 44.3% 57.9% 30.7%↑
knowledge-update gpt-4o 78.2% 83.3% 6.52%↑
single-session-user gpt-4o 81.4% 92.9% 14.1%↑

These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.

The decrease in performance for single-session-assistant questions—17.7% for gpt-4o and 9.06% for gpt-4o-mini—represents a notable exception to Zep's otherwise consistent improvements. These are areas of current research and engineering work.

Zep's Performance Scales with Model Capability

Zep's performance improvement over the baseline "needle in a haystack" result increased with model capability. When Zep was paired with gpt-4o, an aggregate 18.5% improvement over the baseline was seen, versus gpt-4o-mini's 15.2% improvement.

We believe this is due to the larger model's ability to better utilize the density and temporal complexity of Zep's memory results. As frontier models evolve, we expect to further increase the information density of Zep's memory results, taking advantage of these models' enhanced capabilities.

Additionally, we expect advancements in model capabilities to significantly enhance our graph construction accuracy, allowing for more sophisticated and nuanced relationship mapping.

Next Steps