Gemini 3 Deep Think: 84.6% on ARC-AGI-2, 0.4% Shy of the AGI Signal Threshold
What ARC-AGI Measures
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is one of the benchmarks closest to testing “general intelligence.”
It doesn’t test knowledge—it tests the ability to solve new problems under unknown rules. Give a visual puzzle, and the model must infer the transformation rule and apply it to new figures.
Far more indicative of general intelligence than knowledge tests like MMLU.
What Happened on February 13, 2026
Google released Gemini 3’s Deep Think dedicated reasoning mode, scoring 84.6% on ARC-AGI-2.
What this number means:
| Threshold | Score | Implication |
|---|---|---|
| Average human | ~60% | |
| ARC Prize “strong AGI signal” | ≥85% | model exhibits genuine general reasoning |
| Gemini 3 Deep Think | 84.6% | 0.4% below the threshold |
Previous best performers were Claude Opus 4.6 and GPT-5.2, both around 75%. Gemini 3 Deep Think pushed to 84.6%.
Technical Approach
Deep Think gives the model more “thinking time”—rather than outputting an answer directly, the model performs multi-step internal verification before committing to a final answer.
It’s Test-time Compute pushed to its limit: spending more inference compute for more accurate results.
Why It Still Didn’t Hit 85%
Google didn’t explicitly explain it, but the industry suspects:
- Some ARC-AGI-2 problems involve real-world physical intuition—even with correct reasoning, models can stall on commonsense knowledge
- The 84.6% to 85% gap might be noise in the evaluation set or question distribution, not a capability gap
ARC Prize’s official position: exceeding 85% means the model demonstrates genuine general reasoning capability. Gemini 3 Deep Think is now the closest challenger.
Practical Meaning for Developers
However high the ARC-AGI score, it’s one specific type of reasoning task.
For product developers, the signal from Deep Think mode: AI on complex multi-step inference tasks is approaching human expert level.
Code review, mathematical proof, scientific hypothesis validation—these tasks can now seriously consider AI as the primary solver, not a backup.