When AI Agents Go Off the Rails: From Model Worship to Environment Engineering

Simi included in AI

2026-04-14 1934 words 10 minutes

Contents

A real scenario.

You put an AI Agent in charge of code review. Day one goes well. Day two you ask it to continue, and it forgets yesterday’s review criteria, starts applying a different logic. Day three it deletes code from a PR because it “looked unused.” Day four you find it quietly modified config.yaml—nobody asked it to. A week later you have inconsistent code styles, a broken test suite, and an Agent that went off the rails for reasons nobody can explain.

You start suspecting the model isn’t strong enough. Switch to a stronger model. Same problems repeat.

The problem isn’t the model. It’s the runtime environment.

Models Are Strong, Yet Agents Keep Failing

The industry’s focus for the past few years has been “model capability.” GPT-4o is stronger than GPT-4, Claude Sonnet 4.x stronger than previous versions. Fair point.

But stronger model doesn’t mean stronger Agent.

A model trained only for single-turn conversations, when tasked with a week-long code project, won’t naturally understand “task state needs persistence,” “don’t modify modules I didn’t ask you to modify,” “run tests before each commit.” These aren’t model flaws—it’s that the runtime environment provides no constraints.

What became clear in 2026: what determines Agent reliability is the Harness, not the model version. An Agent Harness is an engineering scaffold around the model—managing context, enforcing constraints, persisting memory, closing feedback loops.

Industry consensus emerging: Agent = Model (brain) + Harness (body/environment). Model handles reasoning, Harness ensures reasoning results execute reliably.

Harness Engineering’s core question: how do you build a reliable, controllable, traceable runtime environment for AI Agents.

Four Pillars of Harness

1. Context Architecture: Don’t Stuff Garbage Into the Model’s Brain

“Context keeps growing, output quality keeps declining”—this is “context rot.”

Common approach: stuff everything into context. System prompt, dozens of few-shot examples, all project docs, recent conversation history. Context grows, model wanders in irrelevant information, starts exhibiting “hallucinated instructions”—executing actions that were never in the prompt.

Even with Claude’s 200K token context window, the “lost in the middle” problem appears: the model’s attention drops significantly for information in the middle of the context, causing critical constraints to be quietly ignored. More tokens ≠ better output.

Context Architecture’s core is tiering, progressive disclosure, and RAG-powered dynamic retrieval:

System tier: fixed rules (coding style, architectural constraints, invariants)
   ↓ injected on demand
Task tier: current goal background, acceptance criteria
   ↓ injected on demand
Execution tier: specific step input/output, passed via Tool calls
   ↑ RAG retrieval
Vector knowledge base: project docs, past decisions, dependencies (pulled on demand, not pre-stuffed)

Model at each step sees only what it actually needs. RAG (Retrieval-Augmented Generation) solves the “too many documents, can’t stuff them all” problem—the model queries the vector store for what it needs, rather than filtering through 300 pages itself.

A more advanced approach is hierarchical memory: short-term context window + episodic external memory (vector DB) + dynamic summarization. The Agent loads specific memory blocks only when needed, preventing unbounded context growth.

2. Mechanical Constraints: What Code Can Block, Don’t Leave to Words

Prompt says “don’t modify tests directory,” model under pressure may ignore it.

That’s not the model’s fault. Prompts are fundamentally “requesting” behavior, not “enforcing” it. The model may weigh: is writing this test helpful for the task? Then decide to skip it.

Mechanical Constraints use tools to enforce:

Linter blocks violations: CI runs static checks, Agent writes non-compliant code → gate fails
Architecture tests: ArchUnit defines component boundaries in code, Agent attempts cross-layer call → test fails
Sandbox isolation: Agent runs only in a controlled environment, unable to touch restricted system resources
Enforcement gates: no human override option, Agent cannot bypass verification

Prompt = suggestion, mechanical constraint = rule. Suggestions can be ignored. Rules cannot.

3. Persistent Memory: Solving AI Amnesia

Models have no persistent state. Every conversation starts from scratch.

What was changed last round, how far did we get, what comes next—the model doesn’t know. Sessions are completely isolated. Day two start feels like a new hire onboarding.

Traditional solution: “pass context along,” stuffing history into new requests. But this doesn’t solve the root problem: context fills up, messages are passed not state, cross-session sharing still impossible.

Harness solution: filesystem-first:

/workspace/
  AGENTS.md          # Project spec, machine-readable
  progress.json      # Current progress state
  memory/
    20260414.md      # Daily work log
    MEMORY.md        # Long-term memory (preferences, decisions, skills)

Agent on startup reads AGENTS.md and progress.json first, forms complete understanding of project state. After execution, writes results back to file. Next session reads same files, seamlessly continues.

An empirical study shows teams using AGENTS.md / CLAUDE.md config files achieve dramatically higher Agent reliability—not because the model is smarter, but because the environment gives the model clear behavioral boundaries.

4. Self-Verification Feedback Loops: Make Agent Re-Understand Requirements, Not Just Re-Run

Traditional development: write code → run tests → fail → read error → fix code → re-run

Agent development adds a step: write code → run tests → fail → re-understand requirements → fix code → verify

That last step is the core difference. Agent failures often aren’t “wrote wrong”—they’re “understood wrong.” It wrote logically consistent code that doesn’t match requirements, based on wrong assumptions.

Claude Code 2.0 introduced Checkpoints (restore points before significant edits, enabling rollback) and Subagents (delegating subtasks to multiple agents for parallel execution)—architectural reinforcement of the feedback loop: failures can roll back to checkpoints, subtask failures don’t cascade to the main task, and verification results can automatically trigger the next repair cycle.

Industry Practices: Not Just Concepts

Stripe Minions: 1,300+ AI-Written PRs Per Week

Stripe’s Minions system is one of the largest production Agent systems publicly disclosed. More than 1,300 PRs per week are written entirely by AI—not a single line by humans—yet every PR is human-reviewed before merge.

Activation is dead simple: react to a Slack message with an emoji, and a Minion spins up and autonomously completes the entire task with no further human prompts. Engineers can launch multiple Minions simultaneously for parallel execution.

The technical core is Blueprints—orchestration flows that split tasks into deterministic nodes (fixed logic) and flexible agent tasks (open-ended AI decisions). Stripe uses this to decide which steps need strict rule enforcement and which steps get handed to the AI to figure out.

The underlying Harness is a customized version of Goose (Block’s open-source agent harness), tailored for Stripe’s scale. Each Minion runs in an isolated AWS EC2 instance with full shell access, with risk contained by the sandbox boundary.

The tool layer is called Toolshed: roughly 500 internal tools and APIs, with the system dynamically surfacing only the tools needed for each specific task rather than dumping all 500 on the model. Every PR must pass rigorous automated tests, static analysis, and Stripe’s full code review bar before merge.

This is mechanical constraints in full production: Slack trigger → Minion autonomous execution → full CI check → human review → merge. No humans in the execution loop—only humans reviewing the final output.

OpenAI Codex: 30 Million Weekly Active Users, 90% of Fortune 100

The OpenAI Codex CLI launched in April 2025. By April 2026, it had reached 30 million weekly active users and 74,000 GitHub Stars, adopted by 90% of Fortune 100 companies.

Each task runs in an isolated cloud sandbox preloaded with the user’s code, supporting code execution, linting, and testing—with traceable command logs. Tasks complete in 1–30 minutes, with output including detailed command logs, diffs, and test results for human review.

GPT-5.3-Codex scored 77.3% on Terminal-Bench and 56.8% on SWE-bench Pro—among the strongest publicly available coding agent benchmarks. As of 2026 it supports parallel task execution and multimodal inputs.

What are humans doing? Designing Harnesses, designing feedback loops, defining verification standards, then letting the Agent run within that framework. This is the ideal state of Harness Engineering: humans do architecture, Agent does implementation, environment does verification.

Anthropic Claude Code: Scala to Java in Minutes, Not Weeks

Claude Code is currently deployed to 100,000+ Cognizant associates for coding, documentation, and DevOps automation.

A real case: Scala → Java code migration, originally requiring multiple engineers working weeks, completed autonomously by Claude Code in minutes. It can read logs and stack traces, diagnose production incidents, propose fix commands, and execute them.

Claude Code 2.0’s core capabilities: Checkpoints (restore points before significant edits, allowing rollback), Subagents (distributing subtasks to multiple agents for parallel execution), and self-verification loops. This is how Anthropic teams use it internally: CI integration, incident response, and daily code review—all as part of Agent workflows.

The key config is CLAUDE.md: a project-level config file that defines agent behavioral boundaries, project norms, and tool-use constraints. An empirical study shows projects with CLAUDE.md achieve dramatically higher Agent reliability—not because the model is smarter, but because the environment is clearer.

This isn’t the agent getting smarter—it’s the environment helping the Agent remember context, stay in bounds, and roll back when needed.

GitHub Copilot Agent Mode: The Most Widely Deployed Harness in Production

GitHub Copilot surpassed 20 million cumulative users by July 2025, also adopted by 90% of Fortune 100 companies.

The numbers speak: developers using Copilot complete tasks 51–55% faster; PR cycle time dropped from 9.6 days to 2.4 days, a 75% reduction. On average 46% of code written by users is Copilot-generated, with developers retaining 88% of suggestions.

Agent Mode is Copilot’s Harness layer—not just autocomplete, but a complete Agent environment capable of planning tasks, calling tools, modifying files, running tests, and self-validating. By mid-2025, 16–23% of active GitHub projects had adopted Agent Mode.

Copilot’s Harness chains together IDE, CI, code search, and test execution into a complete feedback loop. Model does reasoning, toolchain does constraints, environment does verification.

Why Prompt Engineering Isn’t Enough Anymore

	Prompt Engineering	Harness Engineering
Focus	Single interaction quality	Full system lifecycle reliability
Scope	Model prompt	Environment, tools, constraints, state
Failure mode	Prompt ignored	Gate fails
Persistence	Stateless, starts fresh each time	Stateful, continuous across sessions
Scalability	Single-task optimization	Multi-agent, multi-task coordination

This isn’t to say Prompt Engineering is useless. They’re not at the same layer. Prompt Engineering is tactical optimization, Harness Engineering is strategic architecture.

You can optimize Prompt in a system without Harness—limited effect. You can use naive Prompt in a system with complete Harness—system still runs reliably.

Stripe, OpenAI, and Anthropic are all saying the same thing: no matter how well-crafted your Prompt is, it can’t substitute for a rigorous runtime environment.

Current Toolchain

AGENTS.md / CLAUDE.md: Industry standard for storing project specs and machine-readable instruction sets. Empirical study confirms significant reliability gains
MCP (Model Context Protocol): modelcontextprotocol.io—the standardized protocol connecting Agents to external tools. 97 million monthly SDK downloads, nearly 20,000 servers indexed, governance transferred to Linux Foundation’s Agentic AI Foundation in December 2025. Called “USB-C for AI.” The 2026 roadmap includes stateless transport, session migration, and /.well-known server discovery
Goose by Block: Open-source agent harness supporting middleware and tool constraints; Stripe Minions runs on a customized version of Goose
GitHub Copilot Agent Mode: The most widely deployed Harness in production, adopted by 90% of Fortune 100

Closing

Harness Engineering’s essence is turning “how to make AI work reliably” from mysticism into engineering.

When an Agent fails, you shouldn’t check if the model is strong enough—you should check if the Harness is strict enough: did context leak, were constraints bypassed, was state lost, is the feedback loop closed.

Stripe’s 1,300 AI PRs per week, OpenAI Codex’s 30 million weekly active users, Copilot’s 75% PR cycle time reduction—behind these numbers isn’t a stronger model, it’s a more rigorous runtime environment.

Model is brain, Harness is body. A model without Harness is an eloquent empty shell—it looks like it can move, but falls apart under real pressure.