AI Agent Architecture Patterns: Model + Harness + Memory
Rethinking “Agent”
The word “Agent” is overused. Everyone says they built an Agent, but looking at the implementation, 90% is:
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.contentThis isn’t an Agent. It’s an API wrapper.
Real Agents have these characteristics:
- Tool Use: Can call tools, not just generate text
- Long-term Memory: Retains memory across sessions
- Planning/Reasoning: Can decompose tasks, plan steps
- Autonomous Action: Makes decisions within constraints
Combine these components, then it’s an Agent.
Three Core Components
1. Model — Reasoning Engine
Model is the Agent’s “brain”, responsible for understanding input, planning, and generating output.
Selection considerations:
- Capability: Reasoning ability, context length
- Cost: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro
- Tool calling: Some models have native tool calling support (GPT-4o, Claude 3.5 Sonnet)
2. Harness — Execution Environment
Harness bridges Model and the real world. It handles:
- Managing tool definitions and execution
- Processing tool returns, injecting into context
- Controlling token budget (avoiding context overflow)
- Error handling and retry logic
Harness quality directly determines Agent stability.
3. Memory — Knowledge and State
Memory has several layers:
| Type | Duration | Purpose |
|---|---|---|
| Context | Single session | Current conversation window |
| Short-term | 24-48h | Recent interaction history |
| Long-term | Persistent | Cross-session facts and preferences |
Common Architecture Patterns
Pattern 1: Tool Calling Agent (Simplest)
Model has native tool calling support. Harness manages tool registration and execution.
User → Harness: "check the weather"
↓
Model: needs to call weather_tool
↓
Harness: execute tool, return result
↓
Model: generate final response
↓
Harness → UserBest for: Single-step tool calls, no complex planning needed.
Code example:
class ToolCallingAgent:
def __init__(self, model, tools):
self.model = model
self.tools = {t.name: t for t in tools}
def run(self, user_input, max_turns=10):
messages = [{"role": "user", "content": user_input}]
for _ in range(max_turns):
response = self.model.chat(messages, tools=self.tool_schemas)
if response.finish_reason == "tool_calls":
tool_calls = response.tool_calls
# Execute all tools
for tool_call in tool_calls:
result = self.tools[tool_call.name].execute(**tool_call.args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
else:
return response.contentPattern 2: ReAct Agent (Plan + Execute)
ReAct = Reasoning + Acting. Model outputs thought (reasoning), then action, observes result, then thinks again.
Thought: User wants weather info, need to call weather API
Action: weather_tool(location=Beijing)
Observation: Sunny, 25°C
Thought: User asked for clothing advice, combine with weather
Action: null (direct response)Best for: Multi-step reasoning, needs environmental feedback.
Pattern 3: Plan-and-Execute (Separate Planning from Execution)
First generate complete plan with Model, then execute step by step.
class PlanAndExecuteAgent:
def plan(self, task):
# Model generates step plan
return self.model.plan(task) # ["step1", "step2", "step3"]
def execute(self, plan):
results = []
for step in plan:
result = self.execute_step(step)
results.append(result)
# Check result, decide whether to adjust plan
if self.should_replan(step, result):
plan = self.plan(results) # Re-plan
return resultsBest for: Complex tasks requiring global planning.
Pattern 4: Multi-Agent Orchestration
Multiple Agents collaborate, each specializing in one domain.
User → Orchestrator Agent
├── Code Agent (write code)
├── Research Agent (research)
└── Review Agent (review)Best for: Complex tasks requiring multiple specialties.
Key design question: How do Orchestrator and sub-Agents communicate?
Option A: Shared message queue Option B: Hierarchical (Orchestrator directly calls sub-Agents) Option C: Voting/consensus mechanism
Memory Implementation: Pitfalls and Tradeoffs
Memory is where most problems occur.
Context Window is Scarce
GPT-4o 128k context, Claude 3.5 200k context. Sounds large, but:
- 1k lines of code ≈ 4k tokens
- A medium project codebase might be 10k-50k lines
- Historical conversation accumulates fast
Common mistake: Unbounded content in context until tokens explode.
Solutions
- Semantic Search: Only retrieve relevant content into context
- Summarization: Periodically compress history
- Hierarchical Memory: Only put “important” facts in short-term
- Codebase Index: Codebase as separate vector index, not in conversation context
class HierarchicalMemory:
def __init__(self):
self.short_term = [] # Recent N turns
self.long_term = VectorStore() # Important facts vector store
def add(self, turn):
self.short_term.append(turn)
if len(self.short_term) > MAX_SHORT_TERM:
# Compress oldest turn
summary = self.summarize(self.short_term.pop(0))
self.long_term.add(summary)
def get_context(self, query):
# Relevant short-term + semantic search long-term
return self.short_term + self.long_term.search(query)Production Considerations
1. Cost Control
LLM calls are the main cost source.
- Set max_tokens per request
- Monitor token consumption per user session
- Consider smaller models for simple tasks
2. Error Handling
Model output is unpredictable. Production must handle:
- Invalid JSON
- Tool call params not matching schema
- Tool execution timeout/failure
- Model “hallucinating” tool names that don’t exist
def safe_execute(tool_call, tools, max_retries=2):
for attempt in range(max_retries):
try:
return tools[tool_call.name].execute(**tool_call.args)
except KeyError:
# Hallucinated tool name
return f"Error: Tool '{tool_call.name}' not found"
except JSONDecodeError:
# Malformed arguments
if attempt == max_retries - 1:
return "Error: Invalid tool arguments"
continue3. Observability
Agent execution chain must be traceable.
- Input/output of every tool call
- Duration of each step
- Source of final output (which tool’s result)
Conclusion
Core Agent architecture is Model + Harness + Memory:
- Model provides reasoning
- Harness bridges Model and real world
- Memory provides persistence and context
Which pattern to choose depends on task complexity:
- Simple tasks → Tool Calling Agent
- Needs reasoning → ReAct
- Complex multi-step → Plan-and-Execute
- Ultra complex → Multi-Agent
No silver bullet. Production systems often combine patterns, even switching dynamically based on task type.