Evaluating AI-Generated Code Quality: How to Judge if AI Wrote Good Code
Context
AI-generated code has been mainstream for over a year. From initial toys to daily driver tools—the transition faster than expected.
But one question remains inadequately answered: how do you judge if AI-generated code is good quality?
Simply saying “have a human review it” isn’t enough. The reviewer doesn’t know what standards to apply to AI code. AI code often looks correct but has hidden problems.
This article gives my actually-used evaluation framework.
Evaluation Dimensions
1. Correctness (Most Important)
AI code’s biggest problem: looks right, but results are wrong.
Evaluation method:
# Task given to AI: write quicksort
def quicksort(arr):
"""
Returns sorted array using quicksort.
"""
# AI generated code, looks correct
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort(left) + middle + quicksort(right)
# Correctness tests
assert quicksort([3, 1, 4, 1, 5, 9, 2, 6]) == [1, 1, 2, 3, 4, 5, 6, 9]
assert quicksort([1]) == [1]
assert quicksort([]) == []
assert quicksort([2, 2, 2]) == [2, 2, 2]Core problem: AI frequently makes mistakes on edge cases. Your test cases must cover:
- Empty array
- Single element
- All identical elements
- Already sorted
- Reverse sorted
- Very large array
2. Complexity
AI often writes code with suboptimal time/space complexity.
# AI wrote: O(n²)
def find_duplicates_bad(arr):
duplicates = []
for i in range(len(arr)):
for j in range(i+1, len(arr)):
if arr[i] == arr[j] and arr[i] not in duplicates:
duplicates.append(arr[i])
return duplicates
# Better: O(n)
def find_duplicates_good(arr):
seen = set()
duplicates = set()
for num in arr:
if num in seen:
duplicates.add(num)
seen.add(num)
return list(duplicates)Ask AI: “What’s this algorithm’s complexity?” — if AI can’t explain clearly, the code is likely not well thought out.
3. Readability
AI code often “runs but nobody understands it.”
Evaluation questions:
# AI wrote: concise but cryptic
def p(x): return x and (x > 0) and (x % 2 == 0)
# Better:
def is_positive_even(x: int) -> bool:
"""Check if x is a positive even number."""
return x > 0 and x % 2 == 0Evaluation checklist:
- Do variable names make sense?
- Is there a docstring?
- Complex logic annotated?
- Functions single-responsibility?
4. Maintainability
# AI wrote: tightly coupled, hard to test
class UserService:
def __init__(self, db_connection_string):
self.db = connect(db_connection_string)
self.emailer = EmailService()
self.logger = Logger()
def register(self, email, password):
# all dependencies hardcoded, can't test register logic in isolation
...
# Better: dependency injection
class UserService:
def __init__(self, db, emailer, logger):
self.db = db
self.emailer = emailer
self.logger = loggerEvaluation questions:
- Are dependencies explicitly injected?
- Does function/class follow single responsibility?
- Will changing one thing accidentally break another?
Actual Workflow
My current AI code evaluation process:
1. AI generates code
↓
2. Run test suite (must have tests)
↓
3. Ask AI: what's this code's complexity?
↓
4. Ask AI: what's this function/class's responsibility? Does it follow single responsibility?
↓
5. Code review (focus on edge cases and dependency injection)
↓
6. Manually run a few edge casesConclusion
Core principles for evaluating AI code:
- Correctness: must have tests, covering edge cases
- Complexity: ask AI to explain complexity—if unclear, don’t use
- Readability: can you explain what this code does in your own words
- Maintainability: dependency injection? single responsibility?
AI is a tool, not a replacement. AI-generated code still needs human quality assessment.
Treat AI as a junior engineer: can do the work, but must be reviewed.