Contents

Gemini 3.1 Pro: 77.1% on ARC-AGI-2, Hallucination Rate Dropped from 88% to 44%

Release

February 20, 2026—Google officially launches Gemini 3.1 Pro.

This is the non-reasoning mode version of the February 13 Gemini 3 Deep Think, prioritizing practicality and generalization over specialized reasoning breakthroughs.

Core Numbers

ARC-AGI-2 Benchmark

Model ARC-AGI-2 Score vs Previous Gen
Gemini 3.0 Pro 31.1%
Gemini 3.1 Pro 77.1% +148%

Score more than doubled. On ARC-AGI-2—considered one of the best benchmarks for measuring general reasoning—77.1% was the highest ever for a non-reasoning mode model at the time.

Hallucination Rate Improvement

Model Hallucination Rate
Gemini 3.0 Pro 88%
Gemini 3.1 Pro 44%

Cut in half. Google didn’t disclose the exact evaluation methodology (different benchmarks define hallucination differently), but the magnitude is notable.

Tool Use (APEX-Agents)

Model APEX-Agents
Gemini 3.0 Pro 18.4%
Gemini 3.1 Pro 33.5%

Tool call success rate improved ~82%.

How It Differs from Deep Think Mode

Deep Think (February 13) is a dedicated reasoning mode that gives the model maximum thinking time—high latency is the tradeoff.

3.1 Pro is the practical mode, with normal latency. It uses some distillation techniques to transfer Deep Think’s reasoning capability, retaining most of the reasoning gains while returning to acceptable latency.

Meaning for Developers

77.1% on ARC-AGI-2 means this model is quite capable on complex abstract reasoning tasks. A viable choice for products requiring understanding of complex rules and multi-step inference.

44% hallucination rate is still not low, but for use cases involving long document understanding and multi-step reasoning, 3.1 Pro is substantially more usable than the previous generation.