Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 3 · Agents· Lesson 12/22

Agent memory: short and long term

How agents remember within a session and across sessions — and what it costs.

5 min read

Listen — Fernando's cloned voice

0:008:59

Speed

Download

An agent without memory is like a colleague who forgets everything between meetings: you repeat context every time, they make the same mistakes, and they never improve. Memory is what turns a stateless chatbot into an agent that genuinely learns and acts with context — and understanding its costs matters just as much as understanding its benefits.

Short-term memory: the context window is the working memory

Every time the agent calls the model, it sends a list of messages — the conversation history. That is short-term memory: whatever is inside the context window at that moment. There is no magic state in the model; the model is stateless by nature. What looks like "remembering" is simply the history you put in the prompt.

This has a direct consequence: every token in the history costs money and latency. If you accumulate all messages from a long session without criteria, the context grows, cost rises, and — worse — the model starts losing focus on what matters. Researchers call this lost in the middle: information in the middle of a long context tends to be ignored by the model.

The practical solution is to manage history actively. The most common strategies are: keep only the last N messages (sliding window), summarize old history into a compact block, or combine both. The agent does not need everything — it needs enough to act coherently.

Long-term memory: persistence across sessions

Long-term memory is what survives when the session ends. It does not live in the context — it lives in external storage that the agent queries when needed. And here the connection to previous modules becomes explicit: the most effective way to retrieve long-term memory is via semantic search in a vector store, exactly like RAG (Lesson 06).

There are three main flavors you will encounter in practice:

Episodic: facts about past interactions. "Last week the user said they prefer PDF reports." These are events with temporal context.
Semantic: general knowledge the agent has accumulated. "The company runs Kubernetes in us-east-1." Facts without a strong temporal anchor.
Profile (or user memory): stable attributes of the user or entity. Name, preferences, role, decision history. Usually stored in a structured form, not just as vectors.

When the agent starts a new session, it queries the long-term store — using the current context as the query — and injects relevant facts into the prompt. This is memory as retrieval, not memory as global state.

Agent memory layers

Flow showing how the agent accesses and persists memory during and across sessions.

🟦 Sessão atual — Short-term

Janela de Contexto · Context Window
Histórico da Conversa · Conversation History
Sumarizador · Summarizer

🟧 Persistência — Long-term

Vector Store · Episódica + Semântica
Perfil do Usuário · User Profile Store

🤖 Agente — Agent Core

Loop ReAct · Agent Loop
Memory Retriever · Busca semântica

In practice: too much memory is as dangerous as too little

Senior Solutions Architect

In practice, the most common mistake I see in agent implementations is not forgetting to implement memory — it is implementing it without a selection criterion. Storing everything and injecting everything into the context creates two serious problems: token cost grows linearly with history, and the model starts contradicting itself trying to reconcile old facts with new ones. My recommendation: treat memory like a cache — expire what is no longer relevant, prioritize what has high similarity to the current intent, and never inject more than what is necessary for the task at hand.

Trade-offs: the real cost of remembering

Memory is not free. Every design decision has an explicit cost you need to know before going to production.

Tokens and latency: every fact you inject into the context increases the number of input tokens. In models billed per token, this directly impacts cost per session. In models with a limited window, you can simply run out of space.

Noise and hallucination: irrelevant memory in the context is not neutral — it confuses the model. If you inject a fact from six months ago that contradicts the current state, the model may blend both and generate an incorrect answer with high confidence. This is especially critical in episodic memory.

Staleness: profiles and semantic facts age. The user's role changed, the architecture was migrated, the policy was updated. Without an expiration or update strategy, long-term memory becomes misinformation.

Privacy and compliance: user memory is personal data. Before persisting anything across sessions, you need to know where it is stored, who has access, and for how long. This is not an implementation detail — it is an architecture requirement.

The right balance is: actively managed short-term memory (sliding window + summarization), long-term memory retrieved by relevance (not by completeness), and explicit expiration for everything that has a shelf life.

Short-term vs. Long-term memory

	Dimension	Short-term (context)	Long-term (external store)
Where it lives	LLM context window	Vector store / structured DB	—
Duration	Session only	Persists across sessions	—
Access cost	Input tokens (direct)	Search latency + injected tokens	—
Main risk	Long context = lost in the middle	Stale data = staleness hallucination	—
Control strategy	Sliding window + summarization	Relevance retrieval + expiration	—

Flashcards

Memory types

Tap a card to flip it.

What you need to take from this lesson

Short-term memory = history in the context. The model is stateless; state is your problem.

Long-term memory = retrieval from an external store. It works like RAG applied to agent facts.

There are three types: episodic (past events), semantic (general knowledge), and profile (user attributes).

Too much memory costs tokens, generates noise, and can cause hallucination. Select by relevance, not by volume.

Memory data ages and is personal data — expiration and compliance are architecture requirements.

Frequently asked questions

Do I need a dedicated vector store for long-term memory?

Not necessarily. For simple profile memory, a relational database or DynamoDB works fine. A vector store makes sense when you need semantic search — for example, retrieving past episodes based on the current conversation context. Start simple and add complexity when the use case justifies it.

How do I prevent the agent from injecting irrelevant memory into the context?

Use a similarity threshold in the vector search — only inject facts above a minimum score. Also limit the number of retrieved items (small top-k) and consider a re-ranking step to prioritize the most relevant ones for the current intent.

Is agent memory the same as RAG?

The mechanism is the same — embeddings + semantic search + context injection. The difference is the data source: RAG retrieves from external documents (knowledge base), long-term memory retrieves from facts generated by the agent and user during previous interactions. They are complementary layers, not substitutes.

Coming up: Bedrock AgentCore Memory

In Module 4 (Lesson 17), you will see how Amazon Bedrock AgentCore implements these memory layers in a managed way — without having to manually orchestrate the vector store, expiration, and context injection. The concepts in this lesson are the map; AgentCore is the territory on AWS.

References

Lost in the Middle: How Language Models Use Long Contexts (Stanford / UC Berkeley)Amazon Bedrock AgentCore Memory — AWS Docs Building effective agents — Anthropic LangGraph Memory — LangChain Docs

Previous Next lesson