AI Agent 架构模式：Model + Harness + Memory

2024-08-15 约 1635 字预计阅读 4 分钟

重新理解 Agent

“Agent” 这个词被用烂了。每个人都说自己做了 Agent，但仔细看实现，90% 就是：

        
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.content

这不是 Agent。这是 API 封装。

真正的 Agent 有几个特征：

Tool Use：能调用工具，不只是生成文本
Long-term Memory：跨 session 保留记忆
Planning/Reasoning：能拆解任务、规划步骤
Autonomous Action：在约束内自主决策

把这几个组件组合起来，才叫 Agent。

三大组件

1. Model — 推理引擎

Model 是 Agent 的"大脑"，负责理解输入、规划、生成输出。

选择考量：

能力：推理能力、上下文长度
成本：GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro
工具调用：某些模型原生支持 tool calling（如 GPT-4o、Claude 3.5 Sonnet）

2. Harness — 执行环境

Harness 是 Model 和真实世界的桥梁。它负责：

管理 tool definitions 和调用
处理 tool 返回结果，注入 context
控制 token 预算（避免 context overflow）
错误处理和重试逻辑

Harness 的质量直接决定 Agent 的稳定性。

3. Memory — 知识与状态

Memory 分几层：

类型	持续时间	用途
Context	单次 session	当前对话 window
Short-term	24-48h	最近的交互历史
Long-term	长期	跨 session 的事实和偏好

常见架构模式

模式 1: Tool Calling Agent（最简单）

Model 原生支持 tool calling，Harness 负责注册工具和执行。

User → Harness: "帮我查天气"
       ↓
    Model: 需要调用 weather_tool
       ↓
    Harness: 执行 tool，返回结果
       ↓
    Model: 生成最终回复
       ↓
    Harness → User

适用场景：单步工具调用，不需要复杂规划。

代码示例：

        
        
        
    
class ToolCallingAgent:
    def __init__(self, model, tools):
        self.model = model
        self.tools = {t.name: t for t in tools}
    
    def run(self, user_input, max_turns=10):
        messages = [{"role": "user", "content": user_input}]
        
        for _ in range(max_turns):
            response = self.model.chat(messages, tools=self.tool_schemas)
            
            if response.finish_reason == "tool_calls":
                tool_calls = response.tool_calls
                # 执行所有 tool
                for tool_call in tool_calls:
                    result = self.tools[tool_call.name].execute(**tool_call.args)
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": result
                    })
            else:
                return response.content

模式 2: ReAct Agent（规划 + 执行）

ReAct = Reasoning + Acting。Model 输出 thought（思考），然后 action（行动），观察结果后再 thought。

Thought: 用户要查天气，需要调用 weather API
Action: weather_tool(location=北京)
Observation: 天气晴，25度
Thought: 用户问的是穿衣建议，结合天气给出建议
Action: null (直接回复)

适用场景：多步推理，需要观察环境反馈。

模式 3: Plan-and-Execute（计划与执行分离）

先用 Model 生成完整计划，再逐个执行。

        
        
        
    
class PlanAndExecuteAgent:
    def plan(self, task):
        # Model 生成步骤计划
        return self.model.plan(task)  # ["step1", "step2", "step3"]
    
    def execute(self, plan):
        results = []
        for step in plan:
            result = self.execute_step(step)
            results.append(result)
            # 检查结果，决定是否调整后续计划
            if self.should_replan(step, result):
                plan = self.plan(results)  # 重新规划
        return results

适用场景：复杂任务，需要全局规划。

模式 4: Multi-Agent Orchestration

多个 Agent 协作，每个 Agent 专注一个领域。

User → Orchestrator Agent
              ├── Code Agent (写代码)
              ├── Research Agent (查资料)
              └── Review Agent (审查)

适用场景：复杂任务，多个专长协作。

关键设计：Orchestrator 和子 Agent 之间怎么通信？

方案 A: 共享消息队列方案 B: Hierarchical（Orchestrator 直接调用子 Agent）方案 C: 投票/共识机制

Memory 实现：坑与权衡

Memory 是最容易出问题的地方。

Context Window 是稀缺资源

GPT-4o 128k context，Claude 3.5 200k context。听起来很大，但：

1k lines of code ≈ 4k tokens
一个中型项目的代码库可能 10k-50k lines
历史对话记录累积很快

常见错误：无限制往 context 里塞内容，直到 token 爆了。

解决思路

Semantic Search：只检索相关内容进 context
Summarization：定期压缩历史
分层 Memory：只把"重要"的事实放 short-term
Codebase Index：代码库单独做 vector index，不占用 conversation context

        
        
        
    
class HierarchicalMemory:
    def __init__(self):
        self.short_term = []  # 最近 N 轮对话
        self.long_term = VectorStore()  # 重要事实向量存储
    
    def add(self, turn):
        self.short_term.append(turn)
        if len(self.short_term) > MAX_SHORT_TERM:
            # 压缩 oldest turn
            summary = self.summarize(self.short_term.pop(0))
            self.long_term.add(summary)
    
    def get_context(self, query):
        # 相关 short-term + 语义检索 long-term
        return self.short_term + self.long_term.search(query)

生产环境的考量

1. Cost Control

LLM 调用是主要成本来源。

设置每 request 的 max_tokens
监控每 user session 的 token 消耗
考虑用更小的模型处理简单任务

2. Error Handling

Model 输出不可预测。生产环境要处理：

Invalid JSON
Tool call 参数不符合 schema
Tool 执行超时/失败
Model “hallucinating” tool names that don’t exist

        
        
        
    
def safe_execute(tool_call, tools, max_retries=2):
    for attempt in range(max_retries):
        try:
            return tools[tool_call.name].execute(**tool_call.args)
        except KeyError:
            # Hallucinated tool name
            return f"Error: Tool '{tool_call.name}' not found"
        except JSONDecodeError:
            # Malformed arguments
            if attempt == max_retries - 1:
                return "Error: Invalid tool arguments"
            continue

3. Observability

Agent 执行链路要可追踪。

每次 tool call 的输入输出
各 step 的耗时
最终输出的来源（哪个 tool 的结果）

总结

AI Agent 架构的核心是 Model + Harness + Memory 的组合。

Model 提供推理能力
Harness 桥接 Model 和真实世界
Memory 提供持续性和上下文

选择哪个模式，取决于任务复杂度：

简单任务 → Tool Calling Agent
需要推理 → ReAct
复杂多步 → Plan-and-Execute
超复杂 → Multi-Agent

没有银弹。实际生产中往往需要组合使用，甚至需要根据任务类型动态切换模式。

目录