o3 Real Performance on Engineering Tasks: Not Every Problem Is Worth the Wait

Simi included in AI

2026-01-08 536 words 3 minutes

Contents

Where o3 Actually Excels

o3 truly excels at abstract reasoning. SWE-bench Lite (software engineering benchmark) reached 49.3%—20 percentage points higher than Claude 3.7.

That number means: give o3 a real GitHub issue, it has 49% probability of directly fixing the code for you.

49% sounds low, but this is the first time any AI model approaches 50% on software engineering benchmarks—all previous models lingered at 20-30%.

Real Engineering Tests

Tested o3 on real engineering tasks:

1. Bug Location

Task: FastAPI project, /users endpoint occasionally returns 500 errors
Tool: o3

Result: o3 located the issue through log analysis + code review—async session lifecycle under concurrent requests.
Time: 8 minutes.
Accuracy: ✅ Correct

        
        
        
    
# Root cause:
# async def get_db():
#     db = Session()
#     yield db
#     await db.close()  # in concurrent requests, close may execute before commit

# Fix o3 found:
async def get_db():
    async with AsyncSession() as session:
        yield session

2. Code Refactoring

Task: Split a 3000-line Django view function into service layer
Tool: o3

Result: o3 gave a reasonable split plan, preserved original business logic.
Time: 12 minutes.
Output quality: ✅ Usable

3. System Design

Task: Design a real-time chat system supporting 1M concurrent users
Tool: o3

Result: gave WebSocket + Redis pub/sub + sharding plan.
Depth: textbook-level, didn't consider actual cost and ops complexity.

o3’s Limitations

1. Thinking Time Too Long

        
# Even simple tasks o3 thinks for 30+ seconds
# not suitable for daily high-frequency use

# Test: simple Python function
# GPT-4o: 2 seconds
# o3: 45 seconds (thought for a minute)
# Answer: the same

2. High Cost

        
        
        
    
# o3's cost is 10x o1
# Input $15/M tokens
# Output $60/M tokens

# One complex task:
# Input: 5000 tokens
# Output: 3000 tokens (includes reasoning)
# = $0.075 + $0.18 = $0.255
# ≈ ¥2 RMB

# If used 20 times daily = ¥40/day = ¥1200/month

3. Not for Simple Tasks

        
        
        
    
# o3 designed for optimizing complex tasks
# But if task is simple:
# - "write me a hello world"
# - "explain this code"
# - "translate to Chinese"
# → o3 and GPT-4o produce same result, but o3 is 50x more expensive

What Scenarios Are Worth o3

        
        
        
    
Worth it:
- Complex bug location (multi-module interaction, concurrency issues)
- Architecture design review
- Codebase-level refactoring plans
- Formal verification
- Hard algorithms (competition level)

Not worth it:
- Simple function writing
- Code completion
- Documentation generation
- Daily chat
- Tasks needing fast iteration

Practical Workflow Recommendation

Daily tasks:
  → GPT-4o / Claude 3.7 Sonnet (fast, cheap)

Complex tasks (worth the wait):
  → o3 (accurate, expensive, slow)

Combined approach:
  - use o3 for planning
  - use Sonnet for quick implementation
  - use o3 for critical code review

o3 isn’t a daily tool—it’s expert consultation for hard problems.

Conclusion

o3’s significance: proves the ceiling for LLM capability on engineering tasks is still rising.

49% on SWE-bench sounds low, but it means AI can now independently handle nearly half of software engineering tasks. That number could reach 70% next year.

But o3 isn’t for every scenario. Daily development still uses Sonnet/GPT-4o—save o3 for truly worthy hard problems.

Tool chain layering matters more than blindly chasing the newest model.