Pierre KasparianAI & Data freelancer
← Back to category
AI agentsRAG productionvector databaseElasticsearchLLM pipeline

Persistent AI Agent Memory with Elasticsearch

June 19, 2026 · 7 min read · Guides

Pierre Kasparian

AI Engineer — UTT 4th year · LLM, RAG & GDPR compliance specialist · 15+ client projects

AI agents have a memory problem. Not RAM, but continuity: the moment a session closes, everything a user said, tried, or resolved disappears. The context window is a working buffer, not a memory system.

Direct answer: persistent AI agent memory is an external store structured across multiple types (episodic, semantic, procedural), queryable by content and by time, with strict per-user data isolation. The Elasticsearch Labs team published a complete implementation on this topic, reaching R@10 of 0.89 across 168 questions with zero cross-tenant leaks.

Why the context window is not enough

Injecting the full history of past exchanges into the context seems like a simple solution. It breaks quickly in production for three reasons.

Cost and latency. Each call grows as sessions accumulate. For a production assistant with hundreds of active users, inference cost becomes a real problem.

The "lost in the middle" effect. Models tend to ignore information placed far from the edges of the context window. A critical fact mentioned three sessions ago can simply vanish from the model's attention.

Non-persistence. A crash, a disconnection, an expired session: everything is lost. For an assistant that accompanies a client over several weeks, this is a dealbreaker.

Three indices: episodic, semantic, procedural

The architecture relies on three separate Elasticsearch indices, inspired by the COALA framework from cognitive psychology.

TypeContentLifetime
EpisodicEach user message, timestampedShort-term, decays over time
SemanticStable facts about the user ("Sarah has a Hub v2")Long-term, updated via supersession
ProceduralMulti-step playbooks with success/failure countersLong-term, refined by feedback

This separation gives each memory type its own lifecycle: episodic memory accumulates fast and decays, semantic memory is curated and updated, procedural memory improves with user feedback.

Alongside these three indices, a fourth one is queried read-only: the product catalog or company knowledge base. The agent queries it through the same pipeline as its memories, without additional friction.

The recall pipeline: hybrid retrieval + cross-encoder

Each recall query goes through two stages.

Stage 1: hybrid retrieval with RRF.

The document is indexed twice from a single write: the raw text feeds the BM25 inverted index, and the same text is routed via copy_to to a semantic_text field that automatically generates Jina v5 dense vectors. Both legs are fused by Reciprocal Rank Fusion (RRF) with rank_constant=30, giving more weight to top-ranked positions than Elasticsearch's default of 60.

BM25 anchors literal matches (error codes, version numbers, proper nouns). Dense vectors capture the semantic shape of a question whose answer uses different words. Neither alone covers all cases: together they do.

Stage 2: Jina v2 cross-encoder reranker.

The hybrid retriever over-fetches 80 candidates per leg. These candidates are then reranked by a Jina v2 cross-encoder, which scores each (query, document) pair jointly rather than comparing independent embeddings. This is the same principle as a cross-encoder reranker in a RAG pipeline: cheap broad retrieval, then precise reranking.

This pipeline achieves R@10 of 0.89 on a 168-question evaluation set.

Multi-tenant isolation with DLS

This is the point I find most important for enterprise use: data isolation between users.

Elasticsearch Document Level Security (DLS) lets you attach restrictions directly to API keys. Each user has a key that can only query their own documents. This isolation is enforced at the index level, not in application code. Result: zero cross-tenant leaks across all tests, without development overhead.

# Each query is automatically filtered by user_id via DLS
# No manual application-level filters needed
recall_results = es.search(
    index="agent_memory_semantic",
    query=hybrid_query,
    knn=dense_query
    # DLS automatically adds: filter: { term: { user_id: current_user } }
)

For GDPR contexts, storage-level isolation is far more solid than application-level isolation: even a bug in the code cannot expose one user's data to another.

Writing and consolidation

Episodic writing. Each user message is written to the episodic index before the LLM responds. Agent responses are not stored: they are carried by the conversation history and would dilute the signal-to-noise ratio.

Consolidation. Periodically (or at every turn in demo mode), a consolidation LLM reviews recent episodes and extracts:

  • New semantic facts with their supporting_episode_ids
  • New procedural playbooks if a multi-step resolution has no existing match
  • Counter updates (success_count++ / failure_count++) if the user confirms or rejects a fix

Supersession. When a fact conflicts with a new message ("I moved to Edinburgh"), the agent does not delete the previous fact. It marks it superseded_by=new_id and superseded_at=now. Standard recall automatically filters superseded facts (filter must_not exists field=superseded_by). The audit trail remains intact in the index for queries like "where has she lived?".

What this looks like in production

This architecture makes agents much more useful over long interactions: the agent knows what was tried, what worked, and the user's stable preferences. It does not start from scratch each session.

The complete implementation is available open source on GitHub (linked in the Elasticsearch Labs article). It exposes its tools via the MCP protocol, making it compatible with any agent runtime.

Key production considerations:

  • Per-turn consolidation doubles LLM calls per message. In production, a batch job every 24 hours or beyond a threshold of N episodes is more economical.
  • Episode decay prevents the episodic index from becoming a haystack. Set decay_factor according to your desired lifetime.
  • The success_count / failure_count counters on procedural playbooks are not yet wired into retrieval ranking in the initial implementation. In production, connecting them to the retrieval score automatically surfaces playbooks that have worked.

TL;DR

Robust agent memory relies on three separate indices (episodic, semantic, procedural), hybrid BM25 + dense + cross-encoder reranker retrieval, and DLS tenant isolation. It is buildable on Elasticsearch from an existing open-source implementation. The result: R@10 = 0.89, zero cross-user data leaks.

If you are integrating AI agents into your company and multi-session memory or data isolation are constraints, let's talk.

About the author

Pierre Kasparian

4th-year engineering student at UTT (University of Technology of Troyes) and AI integration freelancer. He deploys LLMs, RAG pipelines, and AI agents for French and European companies, with strong expertise in GDPR compliance and European hosting. 15+ client projects, including Pretto and LiveSession.