AI Agent Autonomy Levels: Is Your Agent L1 or L5
Why We Need Tiers
“AI Agent” is everywhere now. But an Agent labeled the same can range from “just responds to messages” to “completely autonomous work”—night and day difference.
Without tiers:
- can’t evaluate competitors’ real capabilities
- can’t position your own Agent
- can’t know what can be automated vs what needs human oversight
This article borrows from autonomous driving’s tier framework to assess AI Agent capabilities.
Tier Framework
L0: Tool Call
Capability: LLM generates text, tools execute operations.
# L0 Agent
def agent(user_input):
response = llm.chat(user_input) # pure chat
return response
# Characteristics: LLM only generates text, tools are deterministic execution
# Examples: Copilot Chat, simple chatbotsL1: Single-step Tool Orchestration
Capability: LLM decides which tool to call based on user input.
# L1 Agent
def agent(user_input):
intent = llm.classify_intent(user_input) # intent classification
if intent == "github_pr":
return github_api.create_pr(...)
elif intent == "code_review":
return code_review_tool.analyze(...)
# tools preset, LLM only routesL2: Multi-step Tool Chain Orchestration
Capability: LLM autonomously orchestrates multi-step tool chains.
# L2 Agent
def agent(task):
plan = llm.plan(task) # LLM generates plan
for step in plan:
result = execute_tool(step) # execute in sequence
if needs_feedback(result):
plan = llm.adjust_plan(plan, result) # dynamically adjust
return final_resultExamples: Claude Code, Cursor Agent.
L3: Stateful Autonomy
Capability: Agent has memory, maintains state across conversations.
# L3 Agent
class Agent:
def __init__(self):
self.memory = Memory() # persistent memory
self.tools = [...]
def run(self, task):
context = self.memory.get_relevant(task)
plan = llm.plan(task, context=context)
result = self.execute(plan)
self.memory.add(task, result) # remember
return resultL4: Self-evaluating
Capability: Agent evaluates its own output quality and retries if unsatisfied.
# L4 Agent
def agent(task):
plan = llm.plan(task)
result = execute(plan)
# self-evaluation
quality = evaluator.score(result, task)
if quality < threshold:
result = agent.retry(task) # redo
return resultL5: Fully Autonomous
Capability: Agent can complete complex multi-day tasks without human supervision.
# L5 Agent (doesn't exist yet)
# Characteristics:
# - self-learning
# - cross-system coordination
# - long-term planning
# - proactively discovering and fixing issuesRepresentative Products by Tier
| Tier | Products | Autonomy |
|---|---|---|
| L0 | Copilot Chat | text generation only |
| L1 | IFTTT AI, simple bots | rule-based routing |
| L2 | Claude Code, Cursor Agent | multi-step orchestration |
| L3 | OpenClaw | stateful, multi-channel |
| L4 | Devin | self-evaluation, retry |
| L5 | Doesn’t exist | fully autonomous |
How to Evaluate Your Agent
Ask these questions:
1. Can the Agent remember context across conversations?
→ No: L0-L1
→ Yes: L2+
2. Can the Agent operate multiple tools simultaneously?
→ No: L0-L1
→ Yes: L2+
3. Can the Agent evaluate output quality and retry?
→ No: L2-L3
→ Yes: L4+
4. Can the Agent autonomously plan tasks over 10 steps?
→ No: L4
→ Yes: L5Engineering Challenges by Tier
L0-L1: Simple
Main challenges are tool definition and intent classification.
L2: Moderate
# Challenges:
# - tool execution failure handling
# - tool chain observability
# - execution order optimizationL3: Complex
# Challenges:
# - memory retrieval relevance
# - state consistency
# - cross-channel state syncL4: Very Hard
# Challenges:
# - how to define evaluation standards
# - retry strategy (avoid infinite loops)
# - boundaries of self-repairConclusion
Most “AI Agent” products today are actually L2-L3.
True L4 is rare, L5 doesn’t exist. Devin claims L4, but actually still needs human supervision.
When building Agent products, first clarify what tier you’re targeting:
- L2 already solves many problems
- L3 needs additional memory system
- L4 needs self-evaluation framework
Don’t aim for L5 from the start—unrealistic.