Zep Is The New State of the Art In Agent Memory
Setting a new standard for agent memory with up to 100% accuracy gains and 90% lower latency.
With OpenAI's o-series models and advancements from other vendors suggesting the near-term emergence of agents capable of solving highly complex, Ph.D.-level problems, we need to rethink how these agents will access critical information. As agents become pervasive in our daily lives, they'll need access to a vast collection of continuously evolving data spanning user interactions, business operations, and world events.
This data universe will only expand. While we've seen rapid increases in LLM context window sizes and improved recall capabilities, our research shows that recall remains challenging. Even when solved, the computational demands of processing multi-million-token context windows—along with the resulting latency and power consumption—will remain impractical for many real-world use cases.
Zep is a temporal knowledge graph-based memory layer for AI agents that continuously learns from user interactions and changing business data. By providing agents with a complete, holistic view of each user, Zep enables developers to build applications that tackle complex, personalized tasks.
Redefining State of the Art Agent Memory
In research published today, we demonstrate that Zep outperforms the current state-of-the-art memory system, MemGPT (Letta AI), in the Deep Memory Retrieval (DMR) benchmark—the primary evaluation metric established by the Letta/MemGPT team. However, recent advances in LLM capabilities have revealed the DMR benchmark's limitations in assessing real-world performance.
More significantly, Zep excels in the LongMemEval benchmark, a comprehensive and challenging chat history memory evaluation that better reflects real-world enterprise use cases. In this benchmark, Zep delivers aggregate accuracy improvements of up to 18.5%, with individual evaluations showing gains exceeding 100% compared to using the full chat transcript in the context window, all while reducing response latency by 90%.
Notably, Zep's performance advantage over the "needle in a haystack" baseline was more pronounced when paired with gpt-4o as agent LLM compared to the smaller gpt-4o-mini model. While the baseline saw improvement when comparing gpt-4o-mini to gpt-4o, the larger model leveraged Zep's temporal context more effectively.
The compute costs of placing full chat transcripts into the context window are very real. Evaluation with Zep utilized, on average, less than 2% of the baseline tokens while offering latency reductions of an order of magnitude.
Memory | Model | Median Latency | Latency IQR | Context Tokens |
---|---|---|---|---|
Zep | GPT-4o | 2.58s | 0.684s | 1.6k |
Zep | GPT-4o-mini | 3.20s | 1.31s | 1.6k |
Full-context | GPT-4o | 28.9s | 6.01s | 115k |
Full-context | GPT-4o-mini | 31.3s | 8.76s | 115k |
How Zep Works
Zep automatically constructs a temporal knowledge graph from agent user interactions and evolving business data. This knowledge graph contains entities, relationships, and facts related to the user and business context. Zep takes inspiration from GraphRAG, but unlike GraphRAG, Zep maintains a rich temporal understanding of how information evolves over time. As facts change or are superseded, the graph is updated to reflect the new state.
Graphiti is the underlying technology behind Zep’s memory layer. It is an open source library that builds dynamic, temporally-aware knowledge graphs that represent complex, evolving relationships between entities over time. It ingests both unstructured and structured data, and the resulting graph may be queried using a fusion of time, full-text, semantic, and graph algorithm approaches.
The Limits of the Deep Memory Retrieval Evaluation
Unlike the MemGPT team's approach of using recursive transcript summarizations for their baseline evaluation, we tested the Deep Memory Retrieval (DMR) benchmark by including the complete evaluation conversation in the context window.
Both gpt-4o-mini and the legacy gpt-4-turbo model achieved exceptional scores, surpassing MemGPT's reported results. This indicated that the test transcripts' length and complexity weren't challenging enough to properly evaluate contemporary LLMs or the complex problems agents will need to solve.
LongMemEval: A high-quality, challenging conversational dataset
The LongMemEval dataset consists of 500 human-curated question-answer pairs, with answers embedded within a scalable set of user-assistant chat histories. The dataset is designed to test beyond simple fact recall with many tasks requiring complex temporal reasoning.
Question |
Which three events happened in the order from first to last: the day I helped my friend prepare the nursery, the day I helped my cousin pick out stuff for her baby shower, and the day I ordered a customized phone case for my friend's birthday? |
---|---|
Golden Answer |
First, I helped my friend prepare the nursery, then I helped my cousin pick out stuff for her baby shower, and lastly, I ordered a customized phone case for my friend's birthday. |
gpt-4o-mini full context (incorrect) |
Based on the context provided, the order of events from first to last is as follows: 1. The day you helped your cousin pick out stuff for her baby shower. 2. The day you helped your friend prepare the nursery. 3. The day you ordered a customized phone case for your friend's birthday. |
Zep + gpt-4o-mini (correct) |
The events happened in the following order from first to last: 1. You helped your friend prepare the nursery (2023-02-05). 2. You helped your cousin pick out items for her baby shower (2023-02-10). 3. You ordered a customized phone case for your friend's birthday (2023-02-20). |
Due to gpt-4o's 128,000 token context window limitation, we chose to use the LONGMEMEVAL_S dataset—containing approximately 115,000 tokens per problem—rather than the larger LONGMEMEVAL_M dataset with 1,500,000 tokens per problem.
Our Evaluation Methodology
We simplified the LongMemEval team’s proposed test methodology, where, similar to our work with the DMR benchmark, we baselined against the full chat history placed into the context window.
When working with Zep, we first ingested the chat history datasets using Zep's built-in message load capabilities. gpt-4o-mini was used internally by the Zep service to construct the knowledge graph.
To evaluate Zep's performance, we queried Zep using the test question unaltered, with Zep returning relevant memory in the form of a Context field. This was then added to the prompt alongside the question, and passed into the LLM.
An LLM-as-judge was then used to evaluate the result against the Golden. Further detail may be found in the paper.
Excelling at Enterprise-Critical Tasks
Zep's ability to maintain multiple temporal versions of facts, trace the lineage of information changes, and automatically construct a historical narrative of how knowledge has evolved makes it a distinctly different approach from both the current state of the art in agent memory, as well as advanced approaches to RAG.
As evidenced below, relevant results from Zep's temporal knowledge graph enable agents to reason about causality, track the evolution of ideas and facts, and understand the context of changes—capabilities that go far beyond simple fact retrieval.
Question Type | Model | Full-context | Zep | Delta |
---|---|---|---|---|
single-session-preference | gpt-4o-mini | 30.0% | 53.3% | 77.7%↑ |
single-session-assistant | gpt-4o-mini | 81.8% | 75.0% | 9.06%↓ |
temporal-reasoning | gpt-4o-mini | 36.5% | 54.1% | 48.2%↑ |
multi-session | gpt-4o-mini | 40.6% | 47.4% | 16.7%↑ |
knowledge-update | gpt-4o-mini | 76.9% | 74.4% | 3.36%↓ |
single-session-user | gpt-4o-mini | 81.4% | 92.9% | 14.1%↑ |
----------------------------- | -------------- | -------------- | ---------- | ------------------------- |
single-session-preference | gpt-4o | 20.0% | 56.7% | 184%↑ |
single-session-assistant | gpt-4o | 94.6% | 80.4% | 17.7%↓ |
temporal-reasoning | gpt-4o | 45.1% | 62.4% | 38.4%↑ |
multi-session | gpt-4o | 44.3% | 57.9% | 30.7%↑ |
knowledge-update | gpt-4o | 78.2% | 83.3% | 6.52%↑ |
single-session-user | gpt-4o | 81.4% | 92.9% | 14.1%↑ |
These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.
The decrease in performance for single-session-assistant questions—17.7% for gpt-4o and 9.06% for gpt-4o-mini—represents a notable exception to Zep's otherwise consistent improvements. These are areas of current research and engineering work.
Zep's Performance Scales with Model Capability
Zep's performance improvement over the baseline "needle in a haystack" result increased with model capability. When Zep was paired with gpt-4o, an aggregate 18.5% improvement over the baseline was seen, versus gpt-4o-mini's 15.2% improvement.
We believe this is due to the larger model's ability to better utilize the density and temporal complexity of Zep's memory results. As frontier models evolve, we expect to further increase the information density of Zep's memory results, taking advantage of these models' enhanced capabilities.
Additionally, we expect advancements in model capabilities to significantly enhance our graph construction accuracy, allowing for more sophisticated and nuanced relationship mapping.
Next Steps
- Read the paper Zep: A Temporal Knowledge Graph Architecture for Agent Memory
- Visit the GitHub repo
- Sign up for a Zep account