Gemini 3.1 Pro: 77.1% on ARC-AGI-2, Hallucination Rate Dropped from 88% to 44%
Release
February 20, 2026—Google officially launches Gemini 3.1 Pro.
This is the non-reasoning mode version of the February 13 Gemini 3 Deep Think, prioritizing practicality and generalization over specialized reasoning breakthroughs.
Core Numbers
ARC-AGI-2 Benchmark
| Model | ARC-AGI-2 Score | vs Previous Gen |
|---|---|---|
| Gemini 3.0 Pro | 31.1% | — |
| Gemini 3.1 Pro | 77.1% | +148% |
Score more than doubled. On ARC-AGI-2—considered one of the best benchmarks for measuring general reasoning—77.1% was the highest ever for a non-reasoning mode model at the time.
Hallucination Rate Improvement
| Model | Hallucination Rate |
|---|---|
| Gemini 3.0 Pro | 88% |
| Gemini 3.1 Pro | 44% |
Cut in half. Google didn’t disclose the exact evaluation methodology (different benchmarks define hallucination differently), but the magnitude is notable.
Tool Use (APEX-Agents)
| Model | APEX-Agents |
|---|---|
| Gemini 3.0 Pro | 18.4% |
| Gemini 3.1 Pro | 33.5% |
Tool call success rate improved ~82%.
How It Differs from Deep Think Mode
Deep Think (February 13) is a dedicated reasoning mode that gives the model maximum thinking time—high latency is the tradeoff.
3.1 Pro is the practical mode, with normal latency. It uses some distillation techniques to transfer Deep Think’s reasoning capability, retaining most of the reasoning gains while returning to acceptable latency.
Meaning for Developers
77.1% on ARC-AGI-2 means this model is quite capable on complex abstract reasoning tasks. A viable choice for products requiring understanding of complex rules and multi-step inference.
44% hallucination rate is still not low, but for use cases involving long document understanding and multi-step reasoning, 3.1 Pro is substantially more usable than the previous generation.