The category is on fire and it's mostly a vocabulary problem
In the past 18 months, "AI agent" has gone from a research-paper term to a marketing word that means whatever the speaker wants it to mean. It's now applied to:
- A ChatGPT Custom GPT that takes a user prompt and returns text
- A scripted workflow that chains 4 LLM calls
- A LangChain-powered demo that hits two APIs and a vector store
- Cursor's "agent mode," which actually does autonomous tool use
- Claude Code, which does the same in a terminal
- Devin, which approximates a junior engineer
- A Zapier integration that calls an LLM in step 3 of 7
- A "shopping assistant" that's a glorified product-search wrapper
These are wildly different things. Calling them all "agents" is like calling a thermostat, a self-driving car, and a chess engine all "AI" — technically defensible at the broadest level, useless for actually distinguishing what's in front of you.
This post is an attempt to fix the vocabulary. Not because I think semantics matter for their own sake, but because the conflation lets bad products hide behind good products' marketing. I'm going to define "agent" in a tight, falsifiable way, then score popular tools against it. I'll be honest about where VibeKit lands.
A three-property definition
An AI agent, to be worth calling an agent, has to do three things:
- Take goal-directed actions on the world — not just text out, but actual side effects: file edits, API calls, deploys, transactions.
- Loop autonomously — when its first action doesn't complete the goal, it observes the result and decides what to do next, without being re-prompted by a human for each step.
- Carry state across turns within a task — what it learns mid-task informs subsequent decisions in that same task. (Not "remember me across sessions" — that's a separate, harder problem. Just: doesn't reset every time it acts.)
That's it. Three properties: action, autonomy, state. I'll call them A/A/S.
A/A/S is not a high bar. It's a floor. Most "AI agent" products fail at one of the three. The ones that pass aren't necessarily good products — they're just minimally entitled to the name.
A few things this definition deliberately doesn't require:
- Doesn't require multi-agent coordination. A single agent can be a real agent.
- Doesn't require tools beyond a small set. Even one tool (e.g. file editing) is enough if used autonomously.
- Doesn't require persistence across sessions. Cross-session memory is a useful add-on, but a conversation that runs for 20 minutes and gets a non-trivial coding task done end-to-end is still agent-like, even if state evaporates after.
- Doesn't require "intelligence." An agent can be a dumb agent. The category is structural, not capability-based.
What it does require: actually closing a loop. That's the load-bearing part. Without autonomy, you have a chatbot with extra steps. Without state, you have a stateless function. Without action, you have an LLM.
Scoring the popular tools
Let me run this against everything I see called an "agent" in 2026.
Cursor agent mode
- Action: ✅ Edits files, runs terminal commands, uses tools.
- Autonomy: ✅ Loops on its own — runs a command, reads output, decides next step.
- State: ✅ Within a single agent session, tracks what it's tried.
- Verdict: Real agent. (VibeKit hosts Cursor-style workflows if you want the loop running on our infra instead of your laptop.)
Claude Code
- Action: ✅ Same as above — file edits, bash, fetch.
- Autonomy: ✅ Real loop with self-correction.
- State: ✅ Session-scoped state held in conversation.
- Verdict: Real agent. (We also offer Claude Code hosting — same loop, runs in the cloud against your repos.)
Devin
- Action: ✅ Files, browser, terminal.
- Autonomy: ✅ Famously runs for hours unattended on tasks.
- State: ✅ Carries state across long-running tasks.
- Verdict: Real agent. (And one of the more ambitious ones.)
Aider
- Action: ✅ File edits via git.
- Autonomy: ✅ Edits → tests → re-edits in a loop.
- State: ✅ Per-conversation.
- Verdict: Real agent.
ChatGPT (default)
- Action: ❌ No — text out only. ("Code Interpreter" is a separate execution sandbox; it doesn't generalize to file edits or external APIs in your environment.)
- Autonomy: Partially — can call tools when given them, but the consumer product mostly doesn't.
- State: ❌ No state across messages beyond conversation context.
- Verdict: Chatbot with tool access. Not an agent under A/A/S, despite OpenAI's marketing.
ChatGPT Custom GPTs
- Action: ⚠️ Limited — function calls into your API, but no environment access.
- Autonomy: ❌ Single-turn input/output. No loop.
- State: ❌ Stateless across sessions.
- Verdict: Glorified prompt template. The "agent" framing is misleading.
LangChain "agents" (the term in their docs)
- Action: ✅ Tool use available.
- Autonomy: ⚠️ The framework supports loops; many implementations are single-pass.
- State: ⚠️ Optional — depends on whether the developer wired in memory.
- Verdict: Framework, not an agent. A LangChain "agent" is an agent only insofar as the developer made it one. Calling the framework itself an agent is a category error.
Zapier / Make / n8n with LLM steps
- Action: ✅ Yes, in the workflow.
- Autonomy: ❌ The LLM doesn't loop — it's one step in a deterministic pipeline.
- State: ❌ The LLM is invoked stateless.
- Verdict: Automation pipeline with an LLM in it. Not an agent.
"Shopping assistant," "research assistant," "writing assistant" (LLM wrappers)
- Action: ❌ Usually just retrieval + summarization.
- Autonomy: ❌ Single-turn or guided.
- State: ❌ Per-prompt.
- Verdict: Chatbot with a costume. The "assistant" framing is more honest than "agent."
Lovable / Bolt / v0 (vibe-coding tools)
- Action: ✅ Generates code; some deploy.
- Autonomy: ⚠️ One-shot mostly. The user re-prompts for each iteration.
- State: ⚠️ Within the project, yes; within a single generation, no real loop.
- Verdict: Code generators with iterative interfaces. Borderline. The agent loop is the user re-prompting, not the system. I'd call them "AI builders" rather than "AI agents" — and most of their own marketing now uses "builder" or "platform" anyway.
VibeKit
This is the part I'm obligated to be honest about.
- Action: ✅ File edits, deploys, runs commands inside a real container.
- Autonomy: ✅ Loops within a turn (reads files, edits, deploys, verifies, retries on failure).
- State: ✅ Both within a session and across sessions (via the AGENTS.md and MEMORY.md pattern).
- Verdict: Real agent. The autonomy is real and the state is persistent.
But also being honest about where VibeKit's "agent" claim is weaker than it could be:
- The agent doesn't initiate work without prompting. It runs when you message it. It doesn't notice your CI is failing and proactively investigate. (We have heartbeats which approximate this for some scenarios, but it's not the default.)
- The "agentic" surface is mostly single-app scoped. The agent for app A doesn't talk to the agent for app B. Cross-app coordination would be more agent-like.
- The autonomy budget is bounded — the agent will give up rather than spend hours in a loop, which is correct UX but not the unbounded autonomy of, say, Devin.
I'd call VibeKit a "real but constrained" agent. If Devin is at the upper end of the agent spectrum and Custom GPTs are below it entirely, VibeKit sits comfortably above the threshold but well short of "AI software engineer."
Why this taxonomy matters
If you only read up to here, fine. The detail of which tool scores how doesn't matter much. What matters is having a way to read marketing copy with calibrated skepticism.
When a launch post says "we're building agents" — what they actually mean might be:
- Real autonomous agent: rare; usually obvious from the demo (the thing makes mistakes, sees them, recovers)
- Workflow with an LLM: extremely common; "agent" is the marketing word for what used to be called "automation"
- Chatbot with tools: common; the chatbot is the loop, but the user is doing all the agency
- Wishful thinking: the team is hoping their thing acts agent-like; the demo is curated; production is messier
The reason this conflation is harmful isn't pedantic. It's that the underlying engineering for each is wildly different. A real agent has to handle infinite loops, runaway costs, partial failures mid-task, unbounded state growth, and a dozen other failure modes that a chatbot-with-tools doesn't. Buying a "chatbot with tools" thinking you got "real agent" — or vice versa — produces predictable disappointment.
How to test in 30 seconds
If you're evaluating a product that calls itself an agent, do this:
- Give it a task with a step that will fail the first time you describe it. ("Find the broken link in this README" — but the broken link is described in a way that requires checking the file.)
- Watch what it does after the first action returns nothing useful. Does it try a different approach? Or does it return "I don't know"?
That single test separates real agents from chatbots-with-tools. Real agents observe the failure, form a new hypothesis, try again. Chatbots return their best guess and stop.
I run this test on every "agent" product I evaluate. Roughly half pass. The other half — including some very expensive products from companies with very loud marketing — return their first guess and call it done.
The takeaway
"Agent" became a marketing word in 2024–2025 the same way "AI" became one in 2017. It will keep being used loosely; you can't unring that bell. What you can do is build a private vocabulary that distinguishes:
- Agent: takes actions, loops autonomously, holds state. Failure mode is unpredictability.
- Chatbot-with-tools: responds to user actions, doesn't loop, low or no state. Failure mode is being unhelpful.
- Workflow: deterministic pipeline with LLM steps. Failure mode is brittleness when inputs deviate.
- Code generator: produces code in response to prompts; iteration is human-driven. Failure mode is needing the user to keep prompting.
Use the right word for the right product. When a vendor blurs the categories, that's information. Sometimes they're confused; sometimes they're hiding limitations behind a stronger word. Both worth noticing.
If you want to see VibeKit's agent loop in action, start a free app with BYOK and give it the failure-test from above. The agent runs in a real container, hits real failures, and either recovers or tells you what's blocking it. That's the floor of what an agent should do.
Read the architecture details: How AGENTS.md works covers the persistent-memory layer that keeps state across sessions. Persistent AI agent is the product overview.
For new posts, follow @609.sol on X.
VibeKit
Enter App