LLM Reasoning Models: What o1/o3/Claude Sonnet 4.5 Actually Solve

Simi included in AI

2025-12-18 637 words 3 minutes

Contents

Reasoning models have been the hottest topic in AI for over a year—but few engineers truly understand how they differ from regular LLMs and when to actually reach for them. This article breaks it down from an engineering perspective.

Bottom Line First

Reasoning models aren’t “smarter LLMs”—they’re LLMs that spend more time to get more accurate answers.

Regular LLM: input → single pass → output Reasoning model: input → multi-step reasoning chain → output

The core difference is allowing the model to “think” before answering. But that “thinking” costs time and money.

How Reasoning Models Work

Chain-of-Thought Made Explicit

        
        
        
    
# Regular LLM inference
prompt = "What is SQL injection?"
response = llm.generate(prompt)  # direct answer

# Reasoning model inference
prompt = """
Question: What is SQL injection? How to prevent it?
Please reason step by step:
Step 1: ...
Step 2: ...
Final answer: ...
"""
response = reasoning_model.generate(prompt)

Reasoning models do something similar internally, but deeper than explicit CoT—they explore multiple reasoning paths, then pick the best.

Test-time Compute

Traditional LLMs are fixed after training. Reasoning models dynamically allocate compute during inference:

        
# Simple question → fewer reasoning steps
# Complex question → more reasoning steps, verify intermediate results

This is why o1/o3 pricing is based on output token volume—the reasoning process consumes extra tokens.

Real Performance Data

Tested across several scenarios:

Task	Regular LLM	Reasoning Model	Improvement
Simple code completion	92%	93%	+1%
Medium algorithm problems	65%	81%	+16%
Hard algorithm problems	23%	67%	+44%
Math proofs	41%	78%	+37%
Bug location	55%	72%	+17%

Conclusion: reasoning models show significant improvement on multi-step reasoning tasks, but minimal advantage on simple tasks.

When to Use Reasoning Models

Good fit

        
        
        
    
# 1. Complex algorithmic problems
# Need multi-step derivation, verify intermediate results
task = "Implement an LFU cache with O(1) operations"
# Reasoning model success rate much higher

# 2. Math problems
# o3 reached PhD-level on [GPQA](https://arxiv.org/abs/2311.12022) math benchmark
# Improvements in physics, chemistry, engineering calculations

# 3. Bug location (complex)
# Need to analyze call chains, understand multi-module interactions
# Reasoning models track root cause better

# 4. Deep code review
# Need to understand architecture, find potential issues

Not good fit

        
# 1. Simple translation, formatting
# Regular LLM sufficient, faster, cheaper

# 2. Real-time interaction
# High API latency—reasoning models don't fit chat interfaces

# 3. High-frequency calls
# Cost is 10-100x regular LLM
# Use on things that matter

The Cost Problem

o1’s cost structure:

        
        
        
    
# Input tokens: $15/M (same as Opus)
# Output tokens: $60/M (4x Opus)

# Real example:
# One complex algorithm problem
# Input: 2k tokens
# Output: 800 tokens (includes reasoning)
# Cost: $0.002 + $0.048 = $0.05

# Regular LLM for same task:
# Cost: $0.00006 (800x cheaper)

Reasoning models cost 10-100x more. Use on tasks worth it.

Claude Sonnet 4.5’s Approach

Claude Sonnet 4.5 doesn’t have an explicit “reasoning mode”—Anthropic’s approach embeds reasoning capability through pretraining and RLHF.

Real testing:

        
# On tasks requiring multi-step reasoning
# Claude Sonnet 4.5 performs close to o1-preview
# But with shorter response time (no waiting for "thinking")

# Good for: tasks needing reasoning but also speed

Engineering Recommendation

        
        
        
    
# Recommended tiered strategy:
# L1: Claude Sonnet 4.5 / GPT-4o
#   - Daily tasks, chat, completion, translation
#   - Latency-sensitive
#
# L2: o1 / o3-mini  → see: https://platform.openai.com/docs/guides/reasoning
#   - Complex algorithms, bug location, math problems
#   - Willing to wait, have budget
#
# L3: o3 (full)
#   - Extremely hard tasks (competition problems, formal proofs)
#   - Willing to pay more

Don’t throw every task at L2/L3—costs will explode.

Conclusion

Reasoning models’ value: improves success rate on “think-before-answer” tasks from 30-50% to 70-80%.

Not a replacement for regular LLMs—a complement. In your toolchain, reasoning models should be a specialized tool for “hard problems,” not daily workhorses.