AI agent memory is shifting from vector retrieval to a layered systems design

AI Agents April 8, 2026

AI AgentMemoryLong-Term MemoryMulti-AgentRAGWorkflow

Briefing summary

What this briefing helps you get to quickly

AI Agents Sources: 34

Read the shift from single retrieval to layered memory with session state, persistent stores, shared memory, and write policies.

AI AgentMemoryLong-Term MemoryMulti-AgentRAG

Key Point

AI agent memory is shifting from simple vector retrieval to a layered systems design

Reading current public documentation as of April 2026 from OpenAI, Anthropic, Google ADK, LangGraph, Letta, Mem0, Microsoft Azure, and AWS Bedrock / AgentCore alongside papers such as MemGPT, CoALA, LoCoMo, LongMemEval, A-MEM, MemoryOS, Reflective Memory Management, and MemInsight, a clear pattern appears. Agent memory is no longer just about keeping a longer chat history or attaching a vector database. The practical implementation center has moved toward a layered design that separates short-term working state, long-term persistent memory, shared memory, and background consolidation or reflection. The most adopted pattern today is to separate per-session state from persistent stores and re-inject only what matters. The frontier is adding graph memory, file-backed memory, and asynchronous memory updates on top of that base.

8 stacks

Converging implementation surfaces

OpenAI, Anthropic, Google, LangChain, Letta, Mem0, Microsoft, and AWS now document memory as an explicit systems layer.

34 sources

Public evidence base

Official docs and papers alone are now enough to compare short-term, long-term, shared, and consolidation designs.

4 layers

Emerging default shape

Working memory, persistent memory, shared memory, and consolidation are increasingly separated.

1 conclusion

Layered memory is the mainstream design

The common pattern is not a pure vector database. It is a system that separates state from memory.

What Changed

By spring 2026, memory is being treated as a systems problem rather than a model feature

The clearest signal is that major frameworks no longer explain memory as one box. Google ADK separates sessions, state, memory, context caching, and compaction. LangGraph documents persistence and memory as distinct implementation surfaces, while Deep Agents adds agent-scoped, user-scoped, and organization-level memory plus background consolidation. Letta separates memory blocks, shared memory, and archival memory, and Letta Code adds MemFS as an editable file-backed memory layer. AWS Bedrock summarizes ended sessions into a retained memory keyed by memoryId, and Azure AI Foundry exposes memory stores as part of the agent object. Anthropic Claude Code combines CLAUDE.md with auto memory, while OpenAI's ChatGPT Memory separates saved memories from chat history. The pattern is clear: memory is no longer just a longer context window. It is a system layer that decides where to write, what to keep, and when to retrieve.

Short-term memory is moving into session state

Instead of replaying the full transcript every turn, more systems keep thread, session, checkpoint, and tool state outside the prompt and restore only what is needed.

Long-term memory is moving into persistent stores

User preferences, recurring constraints, learned procedures, and prior session summaries are increasingly stored in retrievable external memory rather than raw chat history.

Shared memory is becoming a separate scope

Project, team, and organization knowledge is increasingly separated from personal memory by namespace, scope, or memory block so retrieval stays stable.

Compression and updates are moving off the hot path

Post-session summarization, context compaction, and reflection agents are increasingly handled asynchronously so token cost and latency do not grow linearly with history.

Implementation Map

Current memory systems are easiest to understand as at least four layers

1. Working memory: active execution state

This layer holds thread state, checkpoints, tool outputs, and the most recent observations needed for the current task.
It is the fastest and cheapest layer, but also the shortest-lived.
Google ADK session state, LangGraph persistence, and Azure thread or run state all fit here.

2. Persistent memory: long-term individual memory

This layer stores user preferences, recurring constraints, reusable procedures, and important facts from prior work.
Vector retrieval, profile stores, key-value records, and document collections are the most common implementations.
Mem0 user / agent / session memory, Letta archival memory, and AWS session summary retention all fit this layer.

3. Shared memory: project or organizational memory

This layer stores rules, glossaries, and ongoing project state shared across agents or users.
It needs to stay separate from personal memory or retrieval becomes noisy and inconsistent.
Deep Agents organization-level memory, Letta shared memory, and Mem0 org scope are examples.

4. Consolidation layer: compression, reflection, forgetting

This layer decides what to summarize, what to delete, what to promote, and what to rewrite.
It matters for both cost and accuracy, which is why it is becoming a major differentiator in frontier systems.
AWS memory summarization, Microsoft compaction, and Deep Agents background consolidation are representative examples.

Why It Matters

This shift matters because memory is becoming a decision substrate, not a storage feature

Accuracy now diverges at the memory layer, not only at the model layer

Two agents using the same model can still behave very differently depending on whether they remember who the user is, what was approved last week, and which constraints are still active.

Sending everything every turn does not scale

ADK context compaction and AWS session summarization both make the same point in public docs: practical systems are already moving away from replaying full histories on every invocation.

Memory write rules are becoming governance rules

What gets saved automatically, what stays read-only, and who can update shared memory are no longer minor implementation details. They define audit and failure boundaries.

The impressive part is not recall alone, but controlled recall

The frontier value is not having more stored text. It is being able to control scope, priority, update timing, and compression rules. That is the real break from older retrieval-only designs.

Read Path

Strong memory systems are defined more by how they retrieve than by where they store

Step 1

Scope first

The system first narrows the search to user, agent, project, or organization scope. Deep Agents and Mem0 emphasize scope because retrieval quality drops quickly when scopes are mixed.

Step 2

Then filter by memory type

Preferences, facts, prior procedures, and shared policies should not be fetched the same way. CoALA's split between episodic, semantic, and procedural memory is useful here.

Step 3

Inject only what is needed

LangGraph and Deep Agents both point toward selective loading instead of eagerly placing all memory into the prompt. This reduces context pollution and token waste.

Step 4

Join retrieved memory with live state

Persistent memory alone is not enough. Good systems combine it with the current task state, unfinished tool calls, and approval flags before acting. Retrieval needs a state join.

The practical implication is that teams often get better results by designing scope filtering and memory type selection before they optimize embeddings. Many failures come less from too little memory than from fetching the wrong memory.

Write Path

Frontier systems separate not only what to store, but when to store it

Write during the interaction

Confirmed user preferences and durable project rules are often worth updating immediately. This is where writable memory blocks and file-backed memory are useful.

Summarize after the interaction

AWS Bedrock summarizes sessions asynchronously after they end. That is a practical way to retain long interactions without slowing down the active path.

Compact after a threshold

ADK context compaction uses a sliding window and overlap size to summarize older workflow events. This is especially useful in long multi-step tasks where context growth otherwise becomes expensive.

Delegate restructuring to a separate agent

Deep Agents background consolidation and Letta-style sleeptime patterns show a newer approach: let another agent reorganize memory between active runs. This is where memory starts to look like agent policy.

The key is not to make every memory write immediate. Immediate writes are responsive but noisy. Delayed writes are cleaner but can miss constraints needed for the next step. Strong systems explicitly separate immediate writes, post-session summaries, and disposable observations.

Graph Memory

Knowledge-graph style memory is attracting attention because it handles relations and change better

Graph memory matters because semantic similarity is only part of the problem. Vector retrieval is good at finding related text, but weaker at representing who reports to whom, which approvals apply to which workflow, and which fact supersedes an older one. That is why Mem0 highlights graph memory and why newer work such as A-MEM, MemoryOS, and MemInsight puts more emphasis on restructuring and relation-aware recall.

Where it helps

It is strongest when people, projects, deadlines, dependencies, and approval paths all matter at once. Support, sales, procurement, and research workflows are common examples.

Why it can be better

It can retrieve the right connection rather than the closest paragraph. It also makes partial updates easier when one fact changes but the broader structure remains valid.

Why it is harder

It needs entity extraction, normalization, deduplication, and some form of schema discipline. The setup cost is higher than a flat vector store.

How to adopt it safely

Do not graph everything first. Start with areas where relations are explicit, such as approval chains, customer-account links, or project dependency state.

Blueprints

The best near-term implementations combine multiple memory roles instead of one universal store

Assistant

Personal assistant: profile memory plus session summaries

Keep user preferences, recurring habits, and standing constraints in a profile store. Keep live state in the session. After the interaction, promote only what will matter next time. This is close to the public split visible in OpenAI's saved memories versus chat history and AWS session summary retention.

Coding

Coding agent: repo memory, file memory, and checkpoints

Keep coding rules and review norms in repo-local memory files, hold active task state in checkpoints, and promote only durable learnings into file-backed memory. This makes diffs and audits easier while keeping the live prompt lean.

Support

Support workflow: customer memory, policy memory, and approval graph

Customer history belongs in user memory, refund or legal rules belong in shared policy memory, and exception approval paths fit naturally into graph memory. That keeps personalization and compliance from colliding in one retrieval result.

Research

Research agent: source cache, evidence memory, and contradiction log

Raw documents belong in a cache, validated claims belong in evidence memory, and disagreements belong in a contradiction log. For long investigations, that often works better than one flat retrieval layer.

Operational Risks

Memory systems usually break first on update conflicts and staleness, not on retrieval recall alone

Multiple agents rewrite the same shared memory

As Letta's docs imply, append and targeted replace are relatively safe, but simultaneous full rewrites push systems toward last-writer-wins behavior.

Old memory overrides newer facts

The more long-term memory grows, the more important timestamps, source tracking, and expiry become. Without them, durable memory can become confidently wrong memory.

Shared and personal memory are mixed together

Once user-specific preferences and organizational rules live in one namespace, correctness suffers even if retrieval looks convenient. Scope separation is a correctness requirement, not just a performance trick.

No one evaluates the write path

Many teams test retrieval quality but not what the system actually writes into memory. In practice, some of the most serious failures originate in bad writes rather than bad reads.

Unresolved Problems

The core weakness today is not failure to remember, but immature policies for what to keep and what to forget

The most important pattern in the research is that memory has moved past the binary question of whether it exists. The harder question is now how it is updated. LoCoMo and LongMemEval both suggest that long-term interaction, updated facts, and multi-session reasoning remain unstable even for strong assistants and long-context models. That means memory quality cannot be solved by retrieval volume alone. Systems need explicit policies for what gets written, promoted, expired, and rewritten.

The write path is less mature than the read path

Many systems can retrieve memory, but fewer can decide reliably what deserves to be written. That leads either to noisy memory or to brittle under-memory.

Old memory and new facts still collide

Without timestamps, source tracking, and expiry, long-term memory often returns answers that sound right but are no longer current. This is a structural weakness of flat retrieval.

Shared memory creates update conflicts

When multiple agents rewrite the same shared block or store, duplication, loss, and last-writer-wins behavior become likely. Letta's warnings around heavy rewrites point directly at this issue.

Mixed scopes damage correctness

Once personal preferences, organizational policies, and project state live in one namespace, the system may retrieve something relevant but still use the wrong premise.

Memory evaluation is still thin

Most teams test retrieval quality more than memory write quality. In practice, some of the largest failures come from bad writes that persist and spread.

Vendor Responses

What differs across vendors is not whether memory exists, but which part of the problem each stack tries to solve

OpenAI and Anthropic separate memory types

OpenAI separates saved memories from chat history, while Anthropic separates CLAUDE.md from auto memory. Both moves reduce the pressure to treat raw conversation as the only memory substrate.

Google ADK and AWS turn compression into a system feature

ADK exposes context compaction with thresholds and overlap, while AWS Bedrock runs asynchronous summarization after sessions end. Both are direct responses to cost and latency pressure.

LangGraph and Deep Agents emphasize scope and permissions

Agent-scoped, user-scoped, and organization-level memory, plus read-only versus writable memory and background consolidation, are practical responses to shared-state and write-quality problems.

Letta controls shared-memory contention at the block level

Its split between insert, replace, and rethink operations, together with the idea of a memory owner, is a concrete answer to concurrent edits in multi-agent systems.

Mem0 and newer papers lean toward graph and reflective updates

Mem0's graph memory and papers such as A-MEM, MemoryOS, Reflective Memory Management, and MemInsight all treat memory as something to restructure and govern, not just to retrieve.

The common thread is clear even if the product shapes differ. Type separation, scope separation, compression, asynchronous updates, and control over shared writes are all becoming first-class design concerns.

Use Cases

Memory becomes essential when work depends on continuity, handoff, or compliance rather than on one-shot generation

Personal assistants need continuity

The value is not generic recall. It is not having to renegotiate preferences, standing constraints, or recurring habits every time.

Coding agents need lower rediscovery cost

Repo rules, prior fixes, danger zones, and unfinished work all benefit from durable memory. This is why file-backed memory is so attractive in coding workflows.

Support and ops need continuity plus compliance

Customer context, policy memory, and approval routes all have to coexist without being confused with each other. That is where layered memory has clear economic value.

Multi-agent systems need reliable handoff

Subagents and specialists need to know what has been done, what remains uncertain, and which constraints are active. Here memory matters more for coordination than for personalization.

Research agents need multi-day evidence continuity

Primary sources, validated claims, contradictions, and open questions all need to persist in different forms. In that setting, memory behaves more like an evidence system than a chat feature.

Direction

The next competition is shifting from whether agents can remember to how well teams can control memory

The likely direction is further layering rather than convergence on one universal memory store. Short-term state, long-term profiles, shared memory, graph relations, file-backed memory, and background consolidation are more likely to be combined by use case than collapsed into one mechanism. The center of differentiation will probably move from retrieval quality alone toward write policy, forgetting policy, and memory auditability.

Vector retrieval will remain, but not as the whole story

It will likely stay as a major long-term memory substrate, but increasingly alongside structured profiles, graph memory, and file-based memory.

Forgetting and consolidation will become core features

In durable systems, the ability to drop or compress the right memories matters as much as the ability to keep them.

Memory will expand from personalization into control-plane infrastructure

Memory is likely to support workflow state, approvals, handoffs, and policy enforcement, not just user convenience.

Evaluation will expand from read quality to write quality

Future benchmarks and production monitoring will likely focus more on what gets written, whether it ages well, and whether it should have been retained at all.

Method Comparison

The most common approach and the most attention-grabbing approach are not exactly the same

Context-only and prompt-pinned memory

CLAUDE.md, system prompts, and pinned memory blocks are still widely used. When curated well they can be highly accurate, but capacity is scarce and updates are expensive.

Vector retrieval is the most widespread long-term memory base

Across LangGraph, Mem0, Letta, Azure, and ADK-style setups, retrieving only what matters remains the most common pattern. It is practical and token-efficient, but weaker on chronology, conflict resolution, and relations.

Structured profiles and scoped state are operationally strong

Keeping preferences and project settings as explicit records or state objects makes updates easier to govern and often reduces false retrieval. This is one reason it shows up so often in production-oriented docs.

Graph memory is the rising accuracy bet

Graph memory makes entities and relations explicit, which helps with who-knows-what, dependency chains, and changing constraints. Mem0 and several recent papers push in this direction, though the implementation cost is higher.

File-backed memory is strong for coding agents

Letta Code's MemFS and Claude Code's project memory show why editable file-based memory matters. In coding workflows, auditability and diffability can matter more than embedding similarity.

Shared memory is powerful but fragile

Once multiple agents write to the same memory, drift, duplication, and stale knowledge become more likely. Read-only versus writable separation and audit traces become important quickly.

Research Signal

The paper trail suggests flat RAG alone is not enough for durable memory quality

The research path is fairly consistent. Generative Agents and MemoryBank showed early that long-term memory works better when events are summarized and reflected on before being stored. MemGPT formalized the idea of moving memory outside the context window, while CoALA framed working, episodic, semantic, and procedural memory as an architecture. LoCoMo and LongMemEval then made the weakness of long-horizon memory visible: even commercial assistants and long-context models degrade noticeably when questions depend on time, updated facts, or multiple sessions. Newer work such as A-MEM, MemoryOS, Reflective Memory Management, and MemInsight increasingly treats memory writing, restructuring, and scheduling as agent policies rather than as passive retrieval. That is the important shift. Memory quality is becoming a systems behavior, not just a retrieval feature. Cross-paper numeric comparisons still need caution, because the benchmarks, base models, and write policies differ.

2023

Summarization and reflection become core memory ideas

Generative Agents, MemoryBank, and MemGPT all point away from replaying raw history and toward compressing and managing memory outside the main prompt.

2024

Benchmarks expose long-term memory weakness

LoCoMo and LongMemEval show that time, multi-session interaction, and updated facts remain hard even for strong models and assistants.

2025-2026

Memory update strategy becomes a first-class research target

A-MEM, MemoryOS, Reflective Memory Management, and MemInsight all move toward explicit policies for what gets written, how it gets reorganized, and when it should be recalled.

Adoption Guidance

The most efficient design today is layered memory with clear scope boundaries

Personal assistants

Separate session state from user profile memory, then promote only important items after the interaction ends. Keeping everything forever increases both token cost and false recall.

Coding agents

Keep project memory in files or repo-local rules, use checkpoints for active state, and keep long-term learnings in searchable logs. Human-readable diffs matter here.

Multi-agent workflows

Keep each agent's working memory separate and treat shared memory as mostly read-only unless there is strong coordination logic. Free-form shared writes are hard to audit and hard to trust.

High-accuracy workflows

Do not rely on vector retrieval alone. Combine structured profiles, graph relations, and deterministic checks before approval or external action, especially when facts can change over time.

Taking the public evidence together, the most adopted implementation pattern is not a standalone vector database. It is a combination of session state + persistent store + selective retrieval + asynchronous consolidation. If shared memory is added, scope and write permissions need to be designed first. Graph memory and file-backed memory are best introduced where accuracy or auditability requirements are already high.

Takeaway

Agent memory is becoming a competition in state and knowledge management, not just in context length

There is no single universal winner yet, but the common design direction is already visible. Short-term execution state is managed in sessions and checkpoints, long-term knowledge is moved into external stores, shared memory is separated by scope, and compression plus updates are handled asynchronously. That layered memory design currently offers the best balance of cost, speed, accuracy, and governance. The frontier question is no longer whether to add memory. It is how to make graph memory, file-backed memory, and reflective memory updates operationally reliable on top of the layered base.

Topic hub

Back to topic

Return to the topic hub to continue with other published briefings in the same category.

AI Agents Articles: 28 Open topic hub

Published evidence

Public pages list only evidence that can be verified as official documentation or papers.

official

OpenAI: Memory and new controls for ChatGPT

https://openai.com/index/memory-and-new-controls-for-chatgpt/

Open

official

Anthropic docs: Claude Code memory

https://docs.anthropic.com/en/docs/claude-code/memory

Open

official

Anthropic docs: Context windows

https://docs.anthropic.com/en/docs/build-with-claude/context-windows

Open

official

Anthropic docs: Prompt caching

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Open

official

Google ADK: Sessions

https://adk.dev/sessions/

Open

official

Google ADK: State

https://adk.dev/sessions/state/

Open

official

Google ADK: Memory

https://adk.dev/sessions/memory/

Open

official

Google ADK: Context caching

https://adk.dev/context/caching/

Open

official

Google ADK: Context compaction

https://adk.dev/context/compaction/

Open

official

LangGraph: Add and manage memory

https://docs.langchain.com/oss/python/langgraph/add-memory

Open

official

LangGraph: Persistence

https://docs.langchain.com/oss/python/langgraph/persistence

Open

official

Deep Agents: Memory

https://docs.langchain.com/oss/python/deepagents/memory

Open

official

Letta docs: Stateful agents

https://docs.letta.com/guides/core-concepts/stateful-agents/

Open

official

Letta docs: Agent memory

https://docs.letta.com/guides/agents/memory

Open

official

Letta docs: Memory blocks

https://docs.letta.com/guides/core-concepts/memory/memory-blocks/

Open

official

Letta docs: Shared memory

https://docs.letta.com/guides/core-concepts/memory/shared-memory/

Open

official

Letta Code: Memory

https://docs.letta.com/letta-code/memory/

Open

official

Mem0 Platform overview

https://docs.mem0.ai/platform/overview

Open

official

Mem0: Memory types

https://docs.mem0.ai/core-concepts/memory-types

Open

official

Mem0: Graph memory

https://docs.mem0.ai/platform/features/graph-memory

Open

official

Microsoft Azure AI Foundry: How to use memory

https://learn.microsoft.com/en-us/azure/foundry/agents/how-to/memory-usage

Open

official

Amazon Bedrock: Retain conversational context across multiple sessions using memory

https://docs.aws.amazon.com/bedrock/latest/userguide/agents-memory.html

Open

official

Amazon Bedrock: Enable agent memory

https://docs.aws.amazon.com/bedrock/latest/userguide/agents-enable-memory.html

Open

official

Amazon Bedrock AgentCore: Memory

https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/memory.html

Open

paper

MemGPT: Towards LLMs as Operating Systems

https://arxiv.org/abs/2310.08560

Open

paper

Generative Agents: Interactive Simulacra of Human Behavior

https://arxiv.org/abs/2304.03442

Open

paper

MemoryBank: Enhancing Large Language Models with Long-Term Memory

https://arxiv.org/abs/2305.10250

Open

paper

CoALA: Cognitive Architectures for Language Agents

https://arxiv.org/abs/2309.02427

Open

paper

LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents

https://aclanthology.org/2024.acl-long.747.pdf

Open

paper

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

https://arxiv.org/abs/2410.10813

Open

paper

A-MEM: Agentic Memory for LLM Agents

https://arxiv.org/abs/2502.12110

Open

paper

Memory OS of AI Agent

https://aclanthology.org/2025.acl-long.491.pdf

Open

paper

In Prospect and Retrospect: Reflective Memory Management for Long-Horizon Personalized Open-Ended Learning

https://aclanthology.org/2025.acl-long.413.pdf

Open

paper

MemInsight: Autonomous Memory-Augmentation for Long-Horizon Agent Tasks

https://aclanthology.org/2025.emnlp-main.1683.pdf

Open