Using Zep's open source embedding feature has the following benefits:
- 🚀 Fast! The model we're using is fast on a CPU and far faster than a call to the OpenAI embedding API. This is particularly important for vector search over message histories, where we've seen latency drop by up to 70%.
- 💸 Free! You're running Zep already, why not use it for embeddings?
- 🥑 Lower storage requirements! Our selected embedding model has 768 element-wide vectors versus OpenAI's 1,536 elements, translating to a ~50% savings on vector storage.
Why is embedding latency important?
While documents in a vector store are usually embedded before prompt creation, search term embeddings must be created on the fly when the search is executed. This puts the embedding speed on the critical path to generating an LLM result.
Zep stores an LLM app's chat history and offers vector search over this memory, enabling long-term context to be added to a prompt. Using a local embedding model dramatically reduces round-trip time to search for historical memories.
How to enable Local Embeddings
Local embeddings will not be available if you're upgrading an existing implementation. This functionality is currently only available with new installations.
If you're running Zep locally with
docker compose, edit the
docker-compose.yml file and set the
ENABLE_EMBEDDINGS environment variable to true.
nlp: image: ghcr.io/getzep/zep-nlp-server:latest container_name: zep-nlp environment: - ENABLE_EMBEDDINGS=true
Alternatively, add the following to the
zep-nlp container's environment:
Before running Zep for the first time, edit your
config.yaml file and uncomment the
local model type and
768 dimensions lines and comment the lines referring to the OpenAI
AdaEmbeddingV2 model. The
embeddings section in the file should look as follows
embeddings: enabled: true # dimensions: 1536 # model: "AdaEmbeddingV2" dimensions: 768 model: "local"
Alternatively, add the following environment variables to the Zep container's environment or your
Why are Local Embeddings "experimental"?
While relatively stable, this feature is not yet fully baked:
- We've not developed a migration path for existing users. Existing messages are not embedded with the new model, and the existing index remains in use.
- We've not yet found a way to optionally deploy Zep with embeddings enabled to Render or other platforms with our default configurations. This is still possible, but DIY. See the environment variable requirements above. You will also need to ensure you have 1GB of memory available for the
- We'd like to understand our user's experience with using the model we've selected versus OpenAI's embedding API, before we promote this feature to stable.
Which model is shipped with Zep?
We use SentenceTransformer's
multi-qa-mpnet-base-dot-v1 model for embedding messages. This model has been specifically trained for semantic search on a large and diverse set of (question, answer) pairs. We selected the model for its accuracy, performance, and relatively low memory footprint.
Other important changes in this release
- We're now using the dot or inner product to calculate search distance. This may affect distance values (or scores if you're using Langchain). If you filter by distance value, please experiment with the new distances before updating.
- When embeddings are enabled, the Zep NLP server uses ~420MiB more memory and requires a model download on first run. Please ensure that your deployment environment has approx. 1GB of memory for this container and allows outbound internet access.
- Zep Memory Store Quick Start Guide