Agent memory: short and long term
How agents remember within a session and across sessions — and what it costs.
5 min read
An agent without memory is like a colleague who forgets everything between meetings: you repeat context every time, they make the same mistakes, and they never improve. Memory is what turns a stateless chatbot into an agent that genuinely learns and acts with context — and understanding its costs matters just as much as understanding its benefits.
Short-term memory: the context window is the working memory
Every time the agent calls the model, it sends a list of messages — the conversation history. That is short-term memory: whatever is inside the context window at that moment. There is no magic state in the model; the model is stateless by nature. What looks like "remembering" is simply the history you put in the prompt.
This has a direct consequence: every token in the history costs money and latency. If you accumulate all messages from a long session without criteria, the context grows, cost rises, and — worse — the model starts losing focus on what matters. Researchers call this lost in the middle: information in the middle of a long context tends to be ignored by the model.
The practical solution is to manage history actively. The most common strategies are: keep only the last N messages (sliding window), summarize old history into a compact block, or combine both. The agent does not need everything — it needs enough to act coherently.
Long-term memory: persistence across sessions
Long-term memory is what survives when the session ends. It does not live in the context — it lives in external storage that the agent queries when needed. And here the connection to previous modules becomes explicit: the most effective way to retrieve long-term memory is via semantic search in a vector store, exactly like RAG (Lesson 06).
There are three main flavors you will encounter in practice:
- Episodic: facts about past interactions. "Last week the user said they prefer PDF reports." These are events with temporal context.
- Semantic: general knowledge the agent has accumulated. "The company runs Kubernetes in us-east-1." Facts without a strong temporal anchor.
- Profile (or user memory): stable attributes of the user or entity. Name, preferences, role, decision history. Usually stored in a structured form, not just as vectors.
When the agent starts a new session, it queries the long-term store — using the current context as the query — and injects relevant facts into the prompt. This is memory as retrieval, not memory as global state.
Agent memory layers
Flow showing how the agent accesses and persists memory during and across sessions.
- Janela de Contexto · Context Window
- Histórico da Conversa · Conversation History
- Sumarizador · Summarizer
- Vector Store · Episódica + Semântica
- Perfil do Usuário · User Profile Store
- Loop ReAct · Agent Loop
- Memory Retriever · Busca semântica
In practice, the most common mistake I see in agent implementations is not forgetting to implement memory — it is implementing it without a selection criterion. Storing everything and injecting everything into the context creates two serious problems: token cost grows linearly with history, and the model starts contradicting itself trying to reconcile old facts with new ones. My recommendation: treat memory like a cache — expire what is no longer relevant, prioritize what has high similarity to the current intent, and never inject more than what is necessary for the task at hand.
Trade-offs: the real cost of remembering
Memory is not free. Every design decision has an explicit cost you need to know before going to production.
Tokens and latency: every fact you inject into the context increases the number of input tokens. In models billed per token, this directly impacts cost per session. In models with a limited window, you can simply run out of space.
Noise and hallucination: irrelevant memory in the context is not neutral — it confuses the model. If you inject a fact from six months ago that contradicts the current state, the model may blend both and generate an incorrect answer with high confidence. This is especially critical in episodic memory.
Staleness: profiles and semantic facts age. The user's role changed, the architecture was migrated, the policy was updated. Without an expiration or update strategy, long-term memory becomes misinformation.
Privacy and compliance: user memory is personal data. Before persisting anything across sessions, you need to know where it is stored, who has access, and for how long. This is not an implementation detail — it is an architecture requirement.
The right balance is: actively managed short-term memory (sliding window + summarization), long-term memory retrieved by relevance (not by completeness), and explicit expiration for everything that has a shelf life.
Short-term vs. Long-term memory
| Dimension | Short-term (context) | Long-term (external store) | |
|---|---|---|---|
| Where it lives | LLM context window | Vector store / structured DB | — |
| Duration | Session only | Persists across sessions | — |
| Access cost | Input tokens (direct) | Search latency + injected tokens | — |
| Main risk | Long context = lost in the middle | Stale data = staleness hallucination | — |
| Control strategy | Sliding window + summarization | Relevance retrieval + expiration | — |
Memory types
Tap a card to flip it.
What you need to take from this lesson
Frequently asked questions
Do I need a dedicated vector store for long-term memory?
Not necessarily. For simple profile memory, a relational database or DynamoDB works fine. A vector store makes sense when you need semantic search — for example, retrieving past episodes based on the current conversation context. Start simple and add complexity when the use case justifies it.
How do I prevent the agent from injecting irrelevant memory into the context?
Use a similarity threshold in the vector search — only inject facts above a minimum score. Also limit the number of retrieved items (small top-k) and consider a re-ranking step to prioritize the most relevant ones for the current intent.
Is agent memory the same as RAG?
The mechanism is the same — embeddings + semantic search + context injection. The difference is the data source: RAG retrieves from external documents (knowledge base), long-term memory retrieves from facts generated by the agent and user during previous interactions. They are complementary layers, not substitutes.
Coming up: Bedrock AgentCore Memory
In Module 4 (Lesson 17), you will see how Amazon Bedrock AgentCore implements these memory layers in a managed way — without having to manually orchestrate the vector store, expiration, and context injection. The concepts in this lesson are the map; AgentCore is the territory on AWS.