AI 应用测试策略：AI 生成代码怎么写测试

Simi 收录于 AI

2025-12-26 约 1631 字预计阅读 4 分钟

你用 AI 写了一个功能，让 AI 顺手把测试也写了，跑了一遍，全绿。上线后，用户报告行为不符合预期。怎么回事？AI 生成测试和 AI 生成实现代码共享同一套假设——如果需求理解有偏差，测试和实现会一起错，还会一起通过。

核心挑战：AI 代码的可测性

AI 代码测试面临三个普通代码没有的挑战：

挑战一：非确定性

LLM 本身是随机的。即使是调用 LLM 的代码，同样的输入也可能产生不同输出。传统的「断言期望值等于固定值」会失效。

挑战二：需求错位

AI 可能生成逻辑自洽但不符合需求的代码。比如「计算折扣」，AI 可能实现了一个数学上正确的折扣算法，但折扣规则与业务文档不符。

挑战三：共同盲点

让 AI 同时生成实现和测试，两者会共享相同的误解。比如 AI 误以为「空数组」是非法输入，它生成的实现会拒绝空数组，它生成的测试也会断言空数组应该抛异常——测试通过了，但需求是空数组应该返回空结果。

策略一：边界测试 + 属性测试

不要只测试「快乐路径」，要测试代码的不变量（无论输入如何都必须成立的性质）。

Hypothesis 是 Python 最成熟的属性测试库，能自动生成边界输入：

        
        
        
    
from hypothesis import given, strategies as st, settings
import pytest

def calculate_discounted_price(price: float, discount_pct: float) -> float:
    if price < 0:
        raise ValueError("Price cannot be negative")
    discount = min(max(discount_pct, 0), 100)  # clamp to 0-100
    return price * (1 - discount / 100)

# 属性测试：定义不变量
@given(
    price=st.floats(min_value=0, max_value=10000, allow_nan=False),
    discount=st.floats(min_value=0, max_value=100, allow_nan=False),
)
@settings(max_examples=500)
def test_discount_invariants(price, discount):
    result = calculate_discounted_price(price, discount)

    # 不变量 1：结果不能超过原价
    assert result <= price + 1e-9

    # 不变量 2：结果不能为负
    assert result >= 0

    # 不变量 3：折扣为 0 时价格不变
    if discount == 0:
        assert abs(result - price) < 1e-9

# 边界用例：显式测试边界条件
@pytest.mark.parametrize("price,discount,expected", [
    (0, 10, 0),      # 零价格
    (100, 0, 100),   # 零折扣
    (100, 100, 0),   # 全额折扣
    (100, 150, 0),   # 超过 100% 折扣（应被 clamp）
])
def test_discount_edge_cases(price, discount, expected):
    assert calculate_discounted_price(price, discount) == expected

Hypothesis 会自动尝试各种边界值（0、NaN、极大值、极小值），比手写边界用例更全面。

策略二：Mock LLM 响应

测试调用 LLM 的代码时，不要真的发 API 请求——成本高、速度慢、结果不稳定。用 Mock 替代：

        
        
        
    
import pytest
import httpx
import respx

# 被测代码
async def summarize_text(text: str, client: httpx.AsyncClient) -> str:
    response = await client.post(
        "https://api.openai.com/v1/chat/completions",
        json={
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": f"Summarize: {text}"}],
        },
    )
    return response.json()["choices"][0]["message"]["content"]

# 用 respx 录制/回放 HTTP 请求
@pytest.mark.asyncio
@respx.mock
async def test_summarize_text():
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(
            200,
            json={"choices": [{"message": {"content": "This is a summary."}}]},
        )
    )

    async with httpx.AsyncClient() as client:
        result = await summarize_text("Long text here...", client)
        assert result == "This is a summary."

# 用 pytest fixture 管理 Mock 状态
@pytest.fixture
def mock_llm_response(mocker):
    mock = mocker.patch("myapp.llm_client.complete")
    mock.return_value = {"content": "mocked response", "usage": {"tokens": 100}}
    return mock

录制真实响应：第一次运行时调用真实 API 并把响应存成 fixture 文件，后续测试使用这些录制的响应。这样既有真实数据，又不依赖外部服务。

策略三：快照测试（Golden File Testing）

对于 LLM 输出这类难以精确断言的内容，用快照测试：捕获一次「已知好的」输出，后续测试与它对比。

        
        
        
    
# pip install pytest-snapshot

def generate_report(data: dict) -> str:
    # 某些 AI 逻辑生成报告
    ...
    return report_text

def test_report_format(snapshot):
    data = {"sales": 1000, "period": "2024-Q4"}
    report = generate_report(data)

    # 第一次运行：生成快照文件
    # 后续运行：与快照对比
    snapshot.assert_match(report, "report_q4.txt")

何时更新快照：

✅ 产品需求变更，输出格式应该改变
✅ Bug 修复，新输出比旧快照更正确
❌ 测试失败但原因不明——先调查，不要盲目更新

快照文件要提交到 git，这样 PR review 时能看到输出变化。

策略四：契约测试

AI 生成的代码可能破坏接口契约（函数签名、返回类型、字段结构）。用类型系统和 Schema 验证来保护：

        
        
        
    
from typing import TypedDict, Required

class SearchResult(TypedDict, total=True):
    id: str
    title: str
    score: float
    metadata: dict

# AI 生成的函数必须符合这个接口
def search_documents(query: str, top_k: int = 5) -> list[SearchResult]:
    ...  # AI 生成的实现

# 契约测试
def test_search_returns_correct_schema():
    results = search_documents("test query")
    assert isinstance(results, list)
    for r in results:
        assert "id" in r and isinstance(r["id"], str)
        assert "title" in r and isinstance(r["title"], str)
        assert "score" in r and isinstance(r["score"], float)
        assert 0.0 <= r["score"] <= 1.0

配合 mypy 的严格模式，在 CI 里用类型检查作为合规门：

mypy --strict src/ --ignore-missing-imports

对于 LLM 工具调用（Tool Call），用 JSON Schema 验证输出：

        
        
        
    
import jsonschema

TOOL_CALL_SCHEMA = {
    "type": "object",
    "required": ["tool_name", "parameters"],
    "properties": {
        "tool_name": {"type": "string", "enum": ["search", "calculate", "lookup"]},
        "parameters": {"type": "object"},
    },
}

def test_tool_call_schema():
    tool_call = get_llm_tool_call(query="search for X")
    jsonschema.validate(tool_call, TOOL_CALL_SCHEMA)  # 不符合 schema 就抛异常

策略五：人工审查门

AI 生成的测试在成为「受信任测试」之前，必须经过人工审查。

在 CI 里标记 AI 生成的文件：

        
        
        
    
# .github/workflows/ci.yml
- name: Check for AI-generated test files
  run: |
    ai_tests=$(grep -rl "# AI-generated" tests/ || true)
    if [ -n "$ai_tests" ]; then
      echo "::warning::AI-generated test files detected, require human review:"
      echo "$ai_tests"
    fi

红队测试：主动尝试让代码通过测试但不满足需求：

把所有测试读一遍，找「测试说的」和「需求说的」之间的差距
编写一个「作弊实现」——只做测试需要的事，不做需求要的事
如果作弊实现能通过所有测试，说明测试覆盖不够

例子：如果测试只检查「函数返回非空字符串」，作弊实现可以返回 "a"。这说明测试没有覆盖内容质量。

目录