Gemini 3 Deep Think: 84.6% on ARC-AGI-2, 0.4% Shy of the AGI Signal Threshold

Simi included in AI

2026-02-13 304 words 2 minutes

Contents

What ARC-AGI Measures

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is one of the benchmarks closest to testing “general intelligence.”

It doesn’t test knowledge—it tests the ability to solve new problems under unknown rules. Give a visual puzzle, and the model must infer the transformation rule and apply it to new figures.

Far more indicative of general intelligence than knowledge tests like MMLU.

What Happened on February 13, 2026

Google released Gemini 3’s Deep Think dedicated reasoning mode, scoring 84.6% on ARC-AGI-2.

What this number means:

Threshold	Score	Implication
Average human	~60%
ARC Prize “strong AGI signal”	≥85%	model exhibits genuine general reasoning
Gemini 3 Deep Think	84.6%	0.4% below the threshold

Previous best performers were Claude Opus 4.6 and GPT-5.2, both around 75%. Gemini 3 Deep Think pushed to 84.6%.

Technical Approach

Deep Think gives the model more “thinking time”—rather than outputting an answer directly, the model performs multi-step internal verification before committing to a final answer.

It’s Test-time Compute pushed to its limit: spending more inference compute for more accurate results.

Why It Still Didn’t Hit 85%

Google didn’t explicitly explain it, but the industry suspects:

Some ARC-AGI-2 problems involve real-world physical intuition—even with correct reasoning, models can stall on commonsense knowledge
The 84.6% to 85% gap might be noise in the evaluation set or question distribution, not a capability gap

ARC Prize’s official position: exceeding 85% means the model demonstrates genuine general reasoning capability. Gemini 3 Deep Think is now the closest challenger.

Practical Meaning for Developers

However high the ARC-AGI score, it’s one specific type of reasoning task.

For product developers, the signal from Deep Think mode: AI on complex multi-step inference tasks is approaching human expert level.

Code review, mathematical proof, scientific hypothesis validation—these tasks can now seriously consider AI as the primary solver, not a backup.