LLM Reasoning Models: What o1/o3/C4 Actually Solve
Bottom Line First
Reasoning models aren’t “smarter LLMs”—they’re LLMs that spend more time to get more accurate answers.
Regular LLM: input → single pass → output Reasoning model: input → multi-step reasoning chain → output
The core difference is allowing the model to “think” before answering. But that “thinking” costs time and money.
How Reasoning Models Work
Chain-of-Thought Made Explicit
# Regular LLM inference
prompt = "What is SQL injection?"
response = llm.generate(prompt) # direct answer
# Reasoning model inference
prompt = """
Question: What is SQL injection? How to prevent it?
Please reason step by step:
Step 1: ...
Step 2: ...
Final answer: ...
"""
response = reasoning_model.generate(prompt)Reasoning models do something similar internally, but deeper than explicit CoT—they explore multiple reasoning paths, then pick the best.
Test-time Compute
Traditional LLMs are fixed after training. Reasoning models dynamically allocate compute during inference:
# Simple question → fewer reasoning steps
# Complex question → more reasoning steps, verify intermediate resultsThis is why o1/o3 pricing is based on output token volume—the reasoning process consumes extra tokens.
Real Performance Data
Tested across several scenarios:
| Task | Regular LLM | Reasoning Model | Improvement |
|---|---|---|---|
| Simple code completion | 92% | 93% | +1% |
| Medium algorithm problems | 65% | 81% | +16% |
| Hard algorithm problems | 23% | 67% | +44% |
| Math proofs | 41% | 78% | +37% |
| Bug location | 55% | 72% | +17% |
Conclusion: reasoning models show significant improvement on multi-step reasoning tasks, but minimal advantage on simple tasks.
When to Use Reasoning Models
Good fit
# 1. Complex algorithmic problems
# Need multi-step derivation, verify intermediate results
task = "Implement an LFU cache with O(1) operations"
# Reasoning model success rate much higher
# 2. Math problems
# o3 reached PhD-level on GPQA math benchmark
# Improvements in physics, chemistry, engineering calculations
# 3. Bug location (complex)
# Need to analyze call chains, understand multi-module interactions
# Reasoning models track root cause better
# 4. Deep code review
# Need to understand architecture, find potential issuesNot good fit
# 1. Simple translation, formatting
# Regular LLM sufficient, faster, cheaper
# 2. Real-time interaction
# High API latency—reasoning models don't fit chat interfaces
# 3. High-frequency calls
# Cost is 10-100x regular LLM
# Use on things that matterThe Cost Problem
o1’s cost structure:
# Input tokens: $15/M (same as Opus)
# Output tokens: $60/M (4x Opus)
# Real example:
# One complex algorithm problem
# Input: 2k tokens
# Output: 800 tokens (includes reasoning)
# Cost: $0.002 + $0.048 = $0.05
# Regular LLM for same task:
# Cost: $0.00006 (800x cheaper)Reasoning models cost 10-100x more. Use on tasks worth it.
Claude 3.5 Sonnet’s Approach
Claude 3.5 Sonnet doesn’t have an explicit “reasoning mode”—Anthropic’s approach embeds reasoning capability through pretraining and RLHF.
Real testing:
# On tasks requiring multi-step reasoning
# Claude 3.5 Sonnet performs close to o1-preview
# But with shorter response time (no waiting for "thinking")
# Good for: tasks needing reasoning but also speedEngineering Recommendation
# Recommended tiered strategy:
# L1: Claude 3.5 Sonnet / GPT-4o
# - Daily tasks, chat, completion, translation
# - Latency-sensitive
#
# L2: o1 / o3-mini
# - Complex algorithms, bug location, math problems
# - Willing to wait, have budget
#
# L3: o3 (full)
# - Extremely hard tasks (competition problems, formal proofs)
# - Willing to pay moreDon’t throw every task at L2/L3—costs will explode.
Conclusion
Reasoning models’ value: improves success rate on “think-before-answer” tasks from 30-50% to 70-80%.
Not a replacement for regular LLMs—a complement. In your toolchain, reasoning models should be a specialized tool for “hard problems,” not daily workhorses.