Contents

Evaluating AI-Generated Code Quality: How to Judge if AI Wrote Good Code

Context

AI-generated code has been mainstream for over a year. From initial toys to daily driver tools—the transition faster than expected.

But one question remains inadequately answered: how do you judge if AI-generated code is good quality?

Simply saying “have a human review it” isn’t enough. The reviewer doesn’t know what standards to apply to AI code. AI code often looks correct but has hidden problems.

This article gives my actually-used evaluation framework.

Evaluation Dimensions

1. Correctness (Most Important)

AI code’s biggest problem: looks right, but results are wrong.

Evaluation method:

# Task given to AI: write quicksort
def quicksort(arr):
    """
    Returns sorted array using quicksort.
    """
    # AI generated code, looks correct
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Correctness tests
assert quicksort([3, 1, 4, 1, 5, 9, 2, 6]) == [1, 1, 2, 3, 4, 5, 6, 9]
assert quicksort([1]) == [1]
assert quicksort([]) == []
assert quicksort([2, 2, 2]) == [2, 2, 2]

Core problem: AI frequently makes mistakes on edge cases. Your test cases must cover:

  • Empty array
  • Single element
  • All identical elements
  • Already sorted
  • Reverse sorted
  • Very large array

2. Complexity

AI often writes code with suboptimal time/space complexity.

# AI wrote: O(n²)
def find_duplicates_bad(arr):
    duplicates = []
    for i in range(len(arr)):
        for j in range(i+1, len(arr)):
            if arr[i] == arr[j] and arr[i] not in duplicates:
                duplicates.append(arr[i])
    return duplicates

# Better: O(n)
def find_duplicates_good(arr):
    seen = set()
    duplicates = set()
    for num in arr:
        if num in seen:
            duplicates.add(num)
        seen.add(num)
    return list(duplicates)

Ask AI: “What’s this algorithm’s complexity?” — if AI can’t explain clearly, the code is likely not well thought out.

3. Readability

AI code often “runs but nobody understands it.”

Evaluation questions:

# AI wrote: concise but cryptic
def p(x): return x and (x > 0) and (x % 2 == 0)

# Better:
def is_positive_even(x: int) -> bool:
    """Check if x is a positive even number."""
    return x > 0 and x % 2 == 0

Evaluation checklist:

  • Do variable names make sense?
  • Is there a docstring?
  • Complex logic annotated?
  • Functions single-responsibility?

4. Maintainability

# AI wrote: tightly coupled, hard to test
class UserService:
    def __init__(self, db_connection_string):
        self.db = connect(db_connection_string)
        self.emailer = EmailService()
        self.logger = Logger()
    
    def register(self, email, password):
        # all dependencies hardcoded, can't test register logic in isolation
        ...

# Better: dependency injection
class UserService:
    def __init__(self, db, emailer, logger):
        self.db = db
        self.emailer = emailer
        self.logger = logger

Evaluation questions:

  • Are dependencies explicitly injected?
  • Does function/class follow single responsibility?
  • Will changing one thing accidentally break another?

Actual Workflow

My current AI code evaluation process:

1. AI generates code
   ↓
2. Run test suite (must have tests)
   ↓
3. Ask AI: what's this code's complexity?
   ↓
4. Ask AI: what's this function/class's responsibility? Does it follow single responsibility?
   ↓
5. Code review (focus on edge cases and dependency injection)
   ↓
6. Manually run a few edge cases

Conclusion

Core principles for evaluating AI code:

  1. Correctness: must have tests, covering edge cases
  2. Complexity: ask AI to explain complexity—if unclear, don’t use
  3. Readability: can you explain what this code does in your own words
  4. Maintainability: dependency injection? single responsibility?

AI is a tool, not a replacement. AI-generated code still needs human quality assessment.

Treat AI as a junior engineer: can do the work, but must be reviewed.