AI Agent Architecture Patterns: Model + Harness + Memory

2024-08-15 835 words 4 minutes

Contents

Rethinking “Agent”

The word “Agent” is overused. Everyone says they built an Agent, but looking at the implementation, 90% is:

        
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.content

This isn’t an Agent. It’s an API wrapper.

Real Agents have these characteristics:

Tool Use: Can call tools, not just generate text
Long-term Memory: Retains memory across sessions
Planning/Reasoning: Can decompose tasks, plan steps
Autonomous Action: Makes decisions within constraints

Combine these components, then it’s an Agent.

Three Core Components

1. Model — Reasoning Engine

Model is the Agent’s “brain”, responsible for understanding input, planning, and generating output.

Selection considerations:

Capability: Reasoning ability, context length
Cost: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro
Tool calling: Some models have native tool calling support (GPT-4o, Claude 3.5 Sonnet)

2. Harness — Execution Environment

Harness bridges Model and the real world. It handles:

Managing tool definitions and execution
Processing tool returns, injecting into context
Controlling token budget (avoiding context overflow)
Error handling and retry logic

Harness quality directly determines Agent stability.

3. Memory — Knowledge and State

Memory has several layers:

Type	Duration	Purpose
Context	Single session	Current conversation window
Short-term	24-48h	Recent interaction history
Long-term	Persistent	Cross-session facts and preferences

Common Architecture Patterns

Pattern 1: Tool Calling Agent (Simplest)

Model has native tool calling support. Harness manages tool registration and execution.

User → Harness: "check the weather"
       ↓
    Model: needs to call weather_tool
       ↓
    Harness: execute tool, return result
       ↓
    Model: generate final response
       ↓
    Harness → User

Best for: Single-step tool calls, no complex planning needed.

Code example:

        
        
        
    
class ToolCallingAgent:
    def __init__(self, model, tools):
        self.model = model
        self.tools = {t.name: t for t in tools}
    
    def run(self, user_input, max_turns=10):
        messages = [{"role": "user", "content": user_input}]
        
        for _ in range(max_turns):
            response = self.model.chat(messages, tools=self.tool_schemas)
            
            if response.finish_reason == "tool_calls":
                tool_calls = response.tool_calls
                # Execute all tools
                for tool_call in tool_calls:
                    result = self.tools[tool_call.name].execute(**tool_call.args)
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": result
                    })
            else:
                return response.content

Pattern 2: ReAct Agent (Plan + Execute)

ReAct = Reasoning + Acting. Model outputs thought (reasoning), then action, observes result, then thinks again.

Thought: User wants weather info, need to call weather API
Action: weather_tool(location=Beijing)
Observation: Sunny, 25°C
Thought: User asked for clothing advice, combine with weather
Action: null (direct response)

Best for: Multi-step reasoning, needs environmental feedback.

Pattern 3: Plan-and-Execute (Separate Planning from Execution)

First generate complete plan with Model, then execute step by step.

        
        
        
    
class PlanAndExecuteAgent:
    def plan(self, task):
        # Model generates step plan
        return self.model.plan(task)  # ["step1", "step2", "step3"]
    
    def execute(self, plan):
        results = []
        for step in plan:
            result = self.execute_step(step)
            results.append(result)
            # Check result, decide whether to adjust plan
            if self.should_replan(step, result):
                plan = self.plan(results)  # Re-plan
        return results

Best for: Complex tasks requiring global planning.

Pattern 4: Multi-Agent Orchestration

Multiple Agents collaborate, each specializing in one domain.

User → Orchestrator Agent
              ├── Code Agent (write code)
              ├── Research Agent (research)
              └── Review Agent (review)

Best for: Complex tasks requiring multiple specialties.

Key design question: How do Orchestrator and sub-Agents communicate?

Option A: Shared message queue Option B: Hierarchical (Orchestrator directly calls sub-Agents) Option C: Voting/consensus mechanism

Memory Implementation: Pitfalls and Tradeoffs

Memory is where most problems occur.

Context Window is Scarce

GPT-4o 128k context, Claude 3.5 200k context. Sounds large, but:

1k lines of code ≈ 4k tokens
A medium project codebase might be 10k-50k lines
Historical conversation accumulates fast

Common mistake: Unbounded content in context until tokens explode.

Solutions

Semantic Search: Only retrieve relevant content into context
Summarization: Periodically compress history
Hierarchical Memory: Only put “important” facts in short-term
Codebase Index: Codebase as separate vector index, not in conversation context

        
        
        
    
class HierarchicalMemory:
    def __init__(self):
        self.short_term = []  # Recent N turns
        self.long_term = VectorStore()  # Important facts vector store
    
    def add(self, turn):
        self.short_term.append(turn)
        if len(self.short_term) > MAX_SHORT_TERM:
            # Compress oldest turn
            summary = self.summarize(self.short_term.pop(0))
            self.long_term.add(summary)
    
    def get_context(self, query):
        # Relevant short-term + semantic search long-term
        return self.short_term + self.long_term.search(query)

Production Considerations

1. Cost Control

LLM calls are the main cost source.

Set max_tokens per request
Monitor token consumption per user session
Consider smaller models for simple tasks

2. Error Handling

Model output is unpredictable. Production must handle:

Invalid JSON
Tool call params not matching schema
Tool execution timeout/failure
Model “hallucinating” tool names that don’t exist

        
        
        
    
def safe_execute(tool_call, tools, max_retries=2):
    for attempt in range(max_retries):
        try:
            return tools[tool_call.name].execute(**tool_call.args)
        except KeyError:
            # Hallucinated tool name
            return f"Error: Tool '{tool_call.name}' not found"
        except JSONDecodeError:
            # Malformed arguments
            if attempt == max_retries - 1:
                return "Error: Invalid tool arguments"
            continue

3. Observability

Agent execution chain must be traceable.

Input/output of every tool call
Duration of each step
Source of final output (which tool’s result)

Conclusion

Core Agent architecture is Model + Harness + Memory:

Model provides reasoning
Harness bridges Model and real world
Memory provides persistence and context

Which pattern to choose depends on task complexity:

Simple tasks → Tool Calling Agent
Needs reasoning → ReAct
Complex multi-step → Plan-and-Execute
Ultra complex → Multi-Agent

No silver bullet. Production systems often combine patterns, even switching dynamically based on task type.