Contents

AI Agent Autonomy Levels: Is Your Agent L1 or L5

Why We Need Tiers

“AI Agent” is everywhere now. But an Agent labeled the same can range from “just responds to messages” to “completely autonomous work”—night and day difference.

Without tiers:

  • can’t evaluate competitors’ real capabilities
  • can’t position your own Agent
  • can’t know what can be automated vs what needs human oversight

This article borrows from autonomous driving’s tier framework to assess AI Agent capabilities.

Tier Framework

L0: Tool Call

Capability: LLM generates text, tools execute operations.

# L0 Agent
def agent(user_input):
    response = llm.chat(user_input)  # pure chat
    return response

# Characteristics: LLM only generates text, tools are deterministic execution
# Examples: Copilot Chat, simple chatbots

L1: Single-step Tool Orchestration

Capability: LLM decides which tool to call based on user input.

# L1 Agent
def agent(user_input):
    intent = llm.classify_intent(user_input)  # intent classification
    if intent == "github_pr":
        return github_api.create_pr(...)
    elif intent == "code_review":
        return code_review_tool.analyze(...)
    # tools preset, LLM only routes

L2: Multi-step Tool Chain Orchestration

Capability: LLM autonomously orchestrates multi-step tool chains.

# L2 Agent
def agent(task):
    plan = llm.plan(task)  # LLM generates plan
    for step in plan:
        result = execute_tool(step)  # execute in sequence
        if needs_feedback(result):
            plan = llm.adjust_plan(plan, result)  # dynamically adjust
    return final_result

Examples: Claude Code, Cursor Agent.

L3: Stateful Autonomy

Capability: Agent has memory, maintains state across conversations.

# L3 Agent
class Agent:
    def __init__(self):
        self.memory = Memory()  # persistent memory
        self.tools = [...]
    
    def run(self, task):
        context = self.memory.get_relevant(task)
        plan = llm.plan(task, context=context)
        result = self.execute(plan)
        self.memory.add(task, result)  # remember
        return result

L4: Self-evaluating

Capability: Agent evaluates its own output quality and retries if unsatisfied.

# L4 Agent
def agent(task):
    plan = llm.plan(task)
    result = execute(plan)
    
    # self-evaluation
    quality = evaluator.score(result, task)
    if quality < threshold:
        result = agent.retry(task)  # redo
    
    return result

L5: Fully Autonomous

Capability: Agent can complete complex multi-day tasks without human supervision.

# L5 Agent (doesn't exist yet)
# Characteristics:
# - self-learning
# - cross-system coordination
# - long-term planning
# - proactively discovering and fixing issues

Representative Products by Tier

Tier Products Autonomy
L0 Copilot Chat text generation only
L1 IFTTT AI, simple bots rule-based routing
L2 Claude Code, Cursor Agent multi-step orchestration
L3 OpenClaw stateful, multi-channel
L4 Devin self-evaluation, retry
L5 Doesn’t exist fully autonomous

How to Evaluate Your Agent

Ask these questions:

1. Can the Agent remember context across conversations?
   → No: L0-L1
   → Yes: L2+

2. Can the Agent operate multiple tools simultaneously?
   → No: L0-L1
   → Yes: L2+

3. Can the Agent evaluate output quality and retry?
   → No: L2-L3
   → Yes: L4+

4. Can the Agent autonomously plan tasks over 10 steps?
   → No: L4
   → Yes: L5

Engineering Challenges by Tier

L0-L1: Simple

Main challenges are tool definition and intent classification.

L2: Moderate

# Challenges:
# - tool execution failure handling
# - tool chain observability
# - execution order optimization

L3: Complex

# Challenges:
# - memory retrieval relevance
# - state consistency
# - cross-channel state sync

L4: Very Hard

# Challenges:
# - how to define evaluation standards
# - retry strategy (avoid infinite loops)
# - boundaries of self-repair

Conclusion

Most “AI Agent” products today are actually L2-L3.

True L4 is rare, L5 doesn’t exist. Devin claims L4, but actually still needs human supervision.

When building Agent products, first clarify what tier you’re targeting:

  • L2 already solves many problems
  • L3 needs additional memory system
  • L4 needs self-evaluation framework

Don’t aim for L5 from the start—unrealistic.