o3 Real Performance on Engineering Tasks: Not Every Problem Is Worth the Wait
Where o3 Actually Excels
o3 truly excels at abstract reasoning. SWE-bench Lite (software engineering benchmark) reached 49.3%—20 percentage points higher than Claude 3.7.
That number means: give o3 a real GitHub issue, it has 49% probability of directly fixing the code for you.
49% sounds low, but this is the first time any AI model approaches 50% on software engineering benchmarks—all previous models lingered at 20-30%.
Real Engineering Tests
Tested o3 on real engineering tasks:
1. Bug Location
Task: FastAPI project, /users endpoint occasionally returns 500 errors
Tool: o3
Result: o3 located the issue through log analysis + code review—async session lifecycle under concurrent requests.
Time: 8 minutes.
Accuracy: ✅ Correct# Root cause:
# async def get_db():
# db = Session()
# yield db
# await db.close() # in concurrent requests, close may execute before commit
# Fix o3 found:
async def get_db():
async with AsyncSession() as session:
yield session2. Code Refactoring
Task: Split a 3000-line Django view function into service layer
Tool: o3
Result: o3 gave a reasonable split plan, preserved original business logic.
Time: 12 minutes.
Output quality: ✅ Usable3. System Design
Task: Design a real-time chat system supporting 1M concurrent users
Tool: o3
Result: gave WebSocket + Redis pub/sub + sharding plan.
Depth: textbook-level, didn't consider actual cost and ops complexity.o3’s Limitations
1. Thinking Time Too Long
# Even simple tasks o3 thinks for 30+ seconds
# not suitable for daily high-frequency use
# Test: simple Python function
# GPT-4o: 2 seconds
# o3: 45 seconds (thought for a minute)
# Answer: the same2. High Cost
# o3's cost is 10x o1
# Input $15/M tokens
# Output $60/M tokens
# One complex task:
# Input: 5000 tokens
# Output: 3000 tokens (includes reasoning)
# = $0.075 + $0.18 = $0.255
# ≈ ¥2 RMB
# If used 20 times daily = ¥40/day = ¥1200/month3. Not for Simple Tasks
# o3 designed for optimizing complex tasks
# But if task is simple:
# - "write me a hello world"
# - "explain this code"
# - "translate to Chinese"
# → o3 and GPT-4o produce same result, but o3 is 50x more expensiveWhat Scenarios Are Worth o3
Worth it:
- Complex bug location (multi-module interaction, concurrency issues)
- Architecture design review
- Codebase-level refactoring plans
- Formal verification
- Hard algorithms (competition level)
Not worth it:
- Simple function writing
- Code completion
- Documentation generation
- Daily chat
- Tasks needing fast iterationPractical Workflow Recommendation
Daily tasks:
→ GPT-4o / Claude 3.7 Sonnet (fast, cheap)
Complex tasks (worth the wait):
→ o3 (accurate, expensive, slow)
Combined approach:
- use o3 for planning
- use Sonnet for quick implementation
- use o3 for critical code reviewo3 isn’t a daily tool—it’s expert consultation for hard problems.
Conclusion
o3’s significance: proves the ceiling for LLM capability on engineering tasks is still rising.
49% on SWE-bench sounds low, but it means AI can now independently handle nearly half of software engineering tasks. That number could reach 70% next year.
But o3 isn’t for every scenario. Daily development still uses Sonnet/GPT-4o—save o3 for truly worthy hard problems.
Tool chain layering matters more than blindly chasing the newest model.