Contents

When AI Agents Go Off the Rails: From Model Worship to Environment Engineering

A real scenario.

You put an AI Agent in charge of code review. Day one goes well. Day two you ask it to continue, and it forgets yesterday’s review criteria, starts applying a different logic. Day three it deletes code from a PR because it “looked unused.” A week later you have inconsistent code styles, a broken test suite, and an Agent that went off the rails for reasons nobody can explain.

You start suspecting the model isn’t strong enough. Switch to a stronger model. Same problems repeat.

The problem isn’t the model. It’s the runtime environment.

Models Are Strong, Yet Agents Keep Failing

The industry’s focus for the past few years has been “model capability.” GPT-4o is stronger than GPT-4, Claude 3.7 stronger than 3.5. Fair point.

But stronger model doesn’t mean stronger Agent.

A model trained only for single-turn conversations, when tasked with a week-long code project, won’t naturally understand “task state needs persistence,” “don’t modify modules I didn’t ask you to modify,” “run tests before each commit.” These aren’t model flaws—it’s that the runtime environment provides no constraints.

Harness Engineering’s core question: how do you build a reliable, controllable, traceable runtime environment for AI Agents.

Industry consensus emerging: Agent = Model (brain) + Harness (body/environment). Model handles reasoning, Harness ensures reasoning results execute reliably.

Four Pillars of Harness

1. Context Architecture: Don’t Stuff Garbage Into the Model’s Brain

“Context keeps growing, output quality keeps declining”—this is “context rot.”

Common approach: stuff everything into context. System prompt, dozens of few-shot examples, all project docs, recent conversation history. Context grows, model wanders in irrelevant information, starts exhibiting “hallucinated instructions”—executing actions that were never in the prompt.

Context Architecture’s core is tiering and progressive disclosure:

System tier: fixed rules (coding style, architectural constraints, invariants)
   ↓ injected on demand
Task tier: current goal background, acceptance criteria
   ↓ injected on demand
Execution tier: specific step input/output, passed via Tool calls

Model at each step sees only what it actually needs—no filtering through a pile of accumulated documents.

2. Mechanical Constraints: What Code Can Block, Don’t Leave to Words

Prompt says “don’t modify tests directory,” model under pressure may ignore it.

That’s not the model’s fault. Prompts are fundamentally “requesting” behavior, not “enforcing” it. The model may weigh: is writing this test helpful for the task? Then decide to skip it.

Mechanical Constraints use tools to enforce, not model self-restraint:

  • Linter blocks violations: CI runs static checks, Agent writes non-compliant code → gate fails
  • Architecture tests: ArchUnit defines component boundaries in code, Agent attempts cross-layer call → test fails
  • Enforcement gates: no human override option, Agent cannot bypass verification

Stripe’s Minions system runs exactly like this. Agent writes code → CI runs tests → fails → Slack notification → Agent reads error, fixes code, re-runs tests. No human in the loop.

Prompt = suggestion, mechanical constraint = rule. Suggestions can be ignored. Rules cannot.

3. Persistent Memory: Solving AI Amnesia

Models have no persistent state. Every conversation starts from scratch.

What was changed last round, how far did we get, what comes next—the model doesn’t know. Sessions are completely isolated. Day two start feels like a new hire onboarding.

Traditional solution: “pass context along,” stuffing history into new requests. But this doesn’t solve the root problem: context fills up, messages are passed not state, cross-session sharing still impossible.

Harness solution: filesystem-first:

/workspace/
  AGENTS.md          # Project spec, machine-readable
  progress.json      # Current progress state
  memory/
    20260414.md      # Daily work log
    MEMORY.md        # Long-term memory (preferences, decisions, skills)

Agent on startup reads AGENTS.md and progress.json first, forms complete understanding of project state. After execution, writes results back to file. Next session reads same files, seamlessly continues.

OpenClaw’s tiered memory is a practical implementation: Daily Notes are raw logs, MEMORY.md is long-term memory, Agent reads these files on each startup, forming continuity.

4. Self-Verification Feedback Loops: Make Agent Re-understand Requirements, Not Just Re-run

Traditional development: write code → run tests → fail → read error → fix code → re-run

Agent development adds a step: write code → run tests → fail → re-understand requirements → fix code → verify

That last step is the core difference. Agent failures often aren’t “wrote wrong”—they’re “understood wrong.” It wrote logically consistent code that doesn’t match requirements, based on wrong assumptions.

Ralph Loop is a typical implementation: Agent submits code, test fails, exit signal intercepted, Agent must re-evaluate its code against original requirements rather than superficially fixing based on error messages.

Industry Practices: Not Just Concepts

OpenAI Codex: 5 Months, 1 Million Lines, No Human-Written Code

OpenAI Codex team produced 1 million lines of code in 5 months. Human engineers never wrote a line of code throughout.

What were humans doing? Designing Harnesses, designing feedback loops, defining verification standards, then letting Agent run within that framework.

This is the ideal state of Harness Engineering: humans do architecture, Agent does implementation, environment does verification.

Anthropic: Cross-Session Progress Persistence

Through claude-progress.txt and Puppeteer browser automation, achieved Agents capable of multi-day complex tasks.

Agent didn’t get smarter—environment helped Agent remember context. It can pause mid-execution, resume, not lose state.

Stripe Minions: Unattended Development Loop

Agent autonomously completes full workflow from writing code, to passing CI, to filing PRs on Slack.

CI is the feedback loop, Slack is the notification mechanism. Combined, they form a complete Agent work cycle.

Why Prompt Engineering Isn’t Enough Anymore

Prompt Engineering Harness Engineering
Focus Single interaction quality Full system lifecycle reliability
Scope Model prompt Environment, tools, constraints, state
Failure mode Prompt ignored Gate fails

This isn’t to say Prompt Engineering is useless. They’re not at the same layer. Prompt Engineering is tactical optimization, Harness Engineering is strategic architecture.

You can optimize Prompt in a system without Harness—limited effect. You can use naive Prompt in a system with complete Harness—system still runs reliably.

Current Toolchain

  • AGENTS.md / CLAUDE.md: Industry standard, storing project specs and machine-readable instruction sets
  • MCP (Model Context Protocol): Standardized protocol connecting Agents to external tools
  • DeepAgents: LangChain’s open-source Harness framework, supporting middleware and self-verification logic

Closing

Harness Engineering’s essence is turning “how to make AI work reliably” from mysticism into engineering.

When an Agent fails, you shouldn’t check if the model is strong enough—you should check if the Harness is strict enough: did context leak, were constraints bypassed, was state lost, is the feedback loop closed.

Model is brain, Harness is body. A model without Harness is an eloquent empty shell—it looks like it can move, but falls apart under real pressure.