AI Application Testing Strategy: How to Test AI-Generated Code

Simi included in AI

2025-12-26 962 words 5 minutes

Contents

You used AI to write a feature and let it write the tests too. Everything passes green. Then in production, users report unexpected behavior. What happened? AI-generated tests and AI-generated implementation share the same assumptions — if the requirement was misunderstood, both the code and the tests are wrong together, and they pass together.

Core Challenge: Testability of AI Code

AI code testing faces three challenges that ordinary code doesn’t:

Challenge 1: Non-determinism

LLMs are stochastic. Even for code that calls LLMs, the same input may produce different outputs. Traditional “assert expected value equals fixed value” breaks down.

Challenge 2: Requirement Mismatch

AI can generate logically self-consistent code that doesn’t match requirements. For “calculate discount”, the AI may implement a mathematically correct algorithm — but with discount rules that don’t match the business document.

Challenge 3: Shared Blind Spots

When you ask AI to generate both implementation and tests simultaneously, they share the same misunderstanding. If AI thinks “empty array is an illegal input”, its implementation rejects empty arrays, and its tests assert that empty arrays should throw — tests pass, but the requirement was that empty arrays should return an empty result.

Strategy 1: Boundary Testing + Property Testing

Don’t just test the happy path. Test invariants — properties that must hold regardless of input.

Hypothesis is Python’s most mature property-testing library. It automatically generates boundary inputs:

        
        
        
    
from hypothesis import given, strategies as st, settings
import pytest

def calculate_discounted_price(price: float, discount_pct: float) -> float:
    if price < 0:
        raise ValueError("Price cannot be negative")
    discount = min(max(discount_pct, 0), 100)  # clamp to 0-100
    return price * (1 - discount / 100)

@given(
    price=st.floats(min_value=0, max_value=10000, allow_nan=False),
    discount=st.floats(min_value=0, max_value=100, allow_nan=False),
)
@settings(max_examples=500)
def test_discount_invariants(price, discount):
    result = calculate_discounted_price(price, discount)

    # Invariant 1: result cannot exceed original price
    assert result <= price + 1e-9

    # Invariant 2: result cannot be negative
    assert result >= 0

    # Invariant 3: zero discount means price unchanged
    if discount == 0:
        assert abs(result - price) < 1e-9

@pytest.mark.parametrize("price,discount,expected", [
    (0, 10, 0),      # zero price
    (100, 0, 100),   # zero discount
    (100, 100, 0),   # full discount
    (100, 150, 0),   # discount over 100% (should be clamped)
])
def test_discount_edge_cases(price, discount, expected):
    assert calculate_discounted_price(price, discount) == expected

Hypothesis automatically tries values like 0, NaN, and extremes — more thorough than hand-written edge cases.

Strategy 2: Mock LLM Responses

When testing code that calls LLM APIs, don’t make real API calls — they’re expensive, slow, and non-deterministic. Use mocks:

        
        
        
    
import pytest
import httpx
import respx

async def summarize_text(text: str, client: httpx.AsyncClient) -> str:
    response = await client.post(
        "https://api.openai.com/v1/chat/completions",
        json={
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": f"Summarize: {text}"}],
        },
    )
    return response.json()["choices"][0]["message"]["content"]

@pytest.mark.asyncio
@respx.mock
async def test_summarize_text():
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(
            200,
            json={"choices": [{"message": {"content": "This is a summary."}}]},
        )
    )

    async with httpx.AsyncClient() as client:
        result = await summarize_text("Long text here...", client)
        assert result == "This is a summary."

@pytest.fixture
def mock_llm_response(mocker):
    mock = mocker.patch("myapp.llm_client.complete")
    mock.return_value = {"content": "mocked response", "usage": {"tokens": 100}}
    return mock

Record real responses: On first run, call the real API and save responses as fixture files. Subsequent tests use the recorded responses — you get real data without depending on external services.

Strategy 3: Snapshot Testing (Golden Files)

For LLM output that’s hard to assert precisely, use snapshot testing: capture a “known good” output once, then compare against it in future runs.

        
# pip install pytest-snapshot

def generate_report(data: dict) -> str:
    # some AI logic that generates a report
    ...

def test_report_format(snapshot):
    data = {"sales": 1000, "period": "2024-Q4"}
    report = generate_report(data)

    # First run: creates the snapshot file
    # Subsequent runs: compares against it
    snapshot.assert_match(report, "report_q4.txt")

When to update snapshots:

✅ Product requirements changed — output format should change
✅ Bug fix — new output is more correct than old snapshot
❌ Test fails for unknown reason — investigate first, don’t blindly update

Commit snapshot files to git so PR reviews show output changes.

Strategy 4: Contract Testing

AI-generated code may break interface contracts (function signatures, return types, field structure). Use type systems and schema validation:

        
        
        
    
from typing import TypedDict

class SearchResult(TypedDict, total=True):
    id: str
    title: str
    score: float
    metadata: dict

# AI-generated function must conform to this interface
def search_documents(query: str, top_k: int = 5) -> list[SearchResult]:
    ...

def test_search_returns_correct_schema():
    results = search_documents("test query")
    assert isinstance(results, list)
    for r in results:
        assert "id" in r and isinstance(r["id"], str)
        assert "title" in r and isinstance(r["title"], str)
        assert "score" in r and isinstance(r["score"], float)
        assert 0.0 <= r["score"] <= 1.0

Run mypy in strict mode in CI as a compliance gate:

mypy --strict src/ --ignore-missing-imports

For LLM tool calls, validate output against JSON Schema:

        
        
        
    
import jsonschema

TOOL_CALL_SCHEMA = {
    "type": "object",
    "required": ["tool_name", "parameters"],
    "properties": {
        "tool_name": {"type": "string", "enum": ["search", "calculate", "lookup"]},
        "parameters": {"type": "object"},
    },
}

def test_tool_call_schema():
    tool_call = get_llm_tool_call(query="search for X")
    jsonschema.validate(tool_call, TOOL_CALL_SCHEMA)  # raises if schema doesn't match

Strategy 5: Human Review Gate

AI-generated tests must be reviewed by humans before they become trusted tests.

Flag AI-generated test files in CI:

        
        
        
    
# .github/workflows/ci.yml
- name: Check for AI-generated test files
  run: |
    ai_tests=$(grep -rl "# AI-generated" tests/ || true)
    if [ -n "$ai_tests" ]; then
      echo "::warning::AI-generated test files detected, require human review:"
      echo "$ai_tests"
    fi

Red-team your own tests:

Read all tests and identify the gap between “what tests say” and “what requirements say”
Write a “cheater implementation” — do only what the tests require, not what requirements require
If the cheater implementation passes all tests, test coverage is insufficient

Example: if your test only checks “function returns a non-empty string”, a cheater implementation can return "a". That reveals the tests don’t cover content quality.