The question that gets answered wrong
Open any agent-building tutorial and you'll see the same diagram: agent → vector database → retrieval → context. "Persistent memory" usually means "RAG with embeddings." It's a real layer, but it's the wrong default for most of what users call memory.
The actual question is: when the conversation ends and the user comes back tomorrow, what should the agent remember?
That has at least three answers and they're not interchangeable.
We run VibeKit — an AI agent platform that gives each user a persistent coding agent with its own workspace, file history, environment, and conversational state. We had to figure this out for production. Here's the model that worked, the model we tried first and replaced, and how to know which layer your use case actually wants.
Layer 1 — Workspace memory (the agent's filesystem)
This is the layer that gets skipped most often, and it's almost always the one the user means.
When a user says "the agent should remember my project," they mean the agent should pick up tomorrow with the same files, the same package.json, the same database schema, the same commits-in-progress that they left. They don't mean "the agent should recall the gist of yesterday's chat" — they mean the literal artifacts of the work should still be there.
On VibeKit each app has a workspace dir on EFS (AWS's network filesystem) mounted into the agent's Fargate container at /home/agent/workspace. The workspace persists across container restarts, agent reprovisioning, model swaps, even cross-Fargate-task moves. Bigger than memory in the colloquial sense: it's the agent's entire working environment, frozen between sessions.
What this layer is good for:
- Code projects (the obvious one)
- Anything where the work has a deliverable file — a doc the agent is editing, a CSV being analyzed, a generated image
- Persistent install state (
node_modules, virtual envs, downloaded models) - The agent's own scratchpad (a
notes/dir the agent writes to and re-reads next session)
What it's not good for:
- Conversational continuity ("you said yesterday that…")
- Cross-user knowledge ("most users in your industry do X")
- Generic facts that don't have a file representation
If you're building an agent and you don't have this layer, you don't have persistence — you have a chatbot. Adding it is not the same as adding a vector DB. It's mounting a real filesystem (or S3, or a Postgres jsonb column, or whatever — the storage medium matters less than the fact that the artifacts persist).
A practical detail people miss: the workspace also persists the state of the environment. Installed dependencies, configured env vars, partially-applied migrations. When the agent comes back, it shouldn't need to redo any of this. On VibeKit we use EFS Access Points with per-app POSIX UIDs so the ownership stays correct across container restarts — solved a real bug where ownership drift was forcing reinstalls on every wake. The fix isn't sexy. It's just unglamorous filesystem plumbing that keeps the persistence promise.
Layer 2 — Session memory (the chat transcript)
This is the layer most tutorials show. It's the one where the agent recalls "yesterday we were discussing X."
The naive implementation is: dump every message into a SQL table keyed by user+session, replay them on next turn. This works for short conversations. It breaks the moment the agent has been talking to a user for two weeks and you're trying to stuff 400 messages into a 200K-token context window.
The actual model on VibeKit is closer to: truncate by recency, summarize by significance.
- The last N turns go in verbatim (we use N=15 in practice — enough for the immediate conversational thread).
- Earlier turns get periodically summarized into a "session digest" that the agent reads as system context.
- Significant decisions get pulled out and pinned (
AGENTS.mdis one such pinned doc — it's the agent's per-app instruction file, edited deliberately, read on every turn).
The hardest part isn't the storage. It's the cadence of compaction. If you summarize too eagerly, you lose nuance. If you wait too long, you blow context budgets and the agent loses its grip. We compact when context utilization crosses ~70% of the available window, with a deliberate hand-off message ("I'm summarizing our earlier discussion for memory; continuing…"). The user sees the seam, which we've found increases trust more than hiding it.
A small but important point: session memory is per-conversation, workspace memory is per-app. Same user, same VibeKit app, different conversation threads = same workspace, different session transcripts. This split matters because it lets users open a fresh chat ("I'm starting over") without losing their work files. Conflating these layers is a common bug. Don't.
When session memory is the right layer:
- Multi-turn task threads (debugging an issue across days)
- Personality continuity ("Brian prefers concise responses")
- Cross-reference of recent decisions ("you said yesterday that…")
When it's not enough:
- Anything that should survive the user starting a new conversation
- Anything that should be searchable rather than recent-biased
Layer 3 — Tool-tracked memory (the agent's structured state)
This is the layer almost nobody builds explicitly, and it's the one that quietly does the most work.
When the agent calls a tool — a Write to a file, an Edit, a Bash command, a Read of a doc — that action is itself memory. The fact that the agent ran npm install fastify yesterday is recorded in the workspace's package.json. The fact that it created an admin route is recorded in the source file. The fact that it ran a migration is recorded in the database. None of this needs a vector store. The system-of-record IS the memory.
The reason this matters: it makes recovery deterministic. When the agent wakes up tomorrow, it doesn't need to recall "did I install fastify yesterday?" — it can read package.json and know. The state of the world IS the memory; the agent re-derives context from it on demand.
We push this hard on VibeKit. The agent's instruction file (AGENTS.md) says: "always read the current state before deciding the next action." That sounds obvious, but most agent loops we've seen take "what I remember" as the source of truth and only fall back to "what's on disk" when memory contradicts itself. We invert it. The disk is the source of truth. The agent's working memory is a cache of the disk. When they disagree, the disk wins.
This has a great practical consequence: model failures are recoverable. If the agent crashes mid-session, the next agent (same or different model) can read the workspace and pick up. There's no "agent personality" lost in the crash, because there was no personality stored separately from the artifacts.
When tool-tracked memory is the right layer:
- Anything with a natural system-of-record (a database, a file, an API endpoint that returns canonical state)
- Idempotent workflows where re-running a tool is free if state is already correct
- Multi-agent setups where any agent can pick up any task
When it's not:
- Anything that's purely in conversation (no artifact, no record — pure dialogue)
- Anecdotal recall ("you mentioned your boss is named Pat") — there's no on-disk place for that
When the vector DB is the right answer
We haven't talked about RAG yet because it usually isn't what people need first. But it has its place. The vector-DB layer is right when:
- The corpus is genuinely larger than context. If you have 10,000 docs the agent might need to query, you need retrieval. If you have 50, just stuff them in system context with structured prefixing — cheaper, more reliable, no embedding-drift problems.
- Search is semantic, not lookup. "Find me docs related to authentication strategies" needs RAG. "Read
AGENTS.md" doesn't. - The user explicitly wants knowledge-base behavior. The agent recalling from a pinned reference library, citing sources, etc.
Most production agent systems we've looked at have RAG bolted on as the first persistence layer and no workspace layer at all. Then they wonder why users can't pick up yesterday's work. The fix isn't more retrieval. It's adding the filesystem they should have had on day one.
A practical decision flow
Working through which layer your agent needs:
- Does the user create artifacts that should persist? (code, docs, generated images, datasets) → You need Layer 1 — workspace. Not negotiable. This is the floor.
- Will conversations span more than 10–20 turns? → You need Layer 2 — session memory with compaction. A SQL table isn't enough; you need a summarization strategy.
- Does the agent take actions that change canonical state? (database writes, API calls, file edits) → You need Layer 3 — tool-tracked. Make it explicit. Tell the agent (in its instruction file) to re-derive context from the state of the world, not from its own memory.
- Does the agent need to recall from a corpus larger than context? → Now consider RAG. And only now.
Most apps that say "we need persistent memory" need (1), (2), and (3) in that order, and don't need (4) at all. The default conversation in the agent community goes (4) first because it's the most discussed and the most vendor-marketed. It's also usually the least urgent.
Why we wrote this
We were getting search traffic for "persistent memory for ai agents" — and our existing post on the topic was at position 37, deep enough that nobody clicked. The query was right; the answer we'd given was too narrow.
Persistent memory isn't one thing. It's at least three. Getting that taxonomy right is the difference between an agent that seems to remember and an agent that actually picks up where the user left off — same workspace, same conversation thread, same understanding of the world it's been changing.
If you're building something agent-shaped right now, audit which layers you have. There's a very good chance you're missing the most important one and have invested heavily in the least important one. We did. The fix wasn't bigger memory. It was the right kind.
VibeKit
Enter App