Gemini 3.1 Pro: 77.1% on ARC-AGI-2, Hallucination Rate Dropped from 88% to 44%

Simi included in AI

2026-02-20 247 words 2 minutes

Contents

Release

February 20, 2026—Google officially launches Gemini 3.1 Pro.

This is the non-reasoning mode version of the February 13 Gemini 3 Deep Think, prioritizing practicality and generalization over specialized reasoning breakthroughs.

Core Numbers

ARC-AGI-2 Benchmark

Model	ARC-AGI-2 Score	vs Previous Gen
Gemini 3.0 Pro	31.1%	—
Gemini 3.1 Pro	77.1%	+148%

Score more than doubled. On ARC-AGI-2—considered one of the best benchmarks for measuring general reasoning—77.1% was the highest ever for a non-reasoning mode model at the time.

Hallucination Rate Improvement

Model	Hallucination Rate
Gemini 3.0 Pro	88%
Gemini 3.1 Pro	44%

Cut in half. Google didn’t disclose the exact evaluation methodology (different benchmarks define hallucination differently), but the magnitude is notable.

Tool Use (APEX-Agents)

Model	APEX-Agents
Gemini 3.0 Pro	18.4%
Gemini 3.1 Pro	33.5%

Tool call success rate improved ~82%.

How It Differs from Deep Think Mode

Deep Think (February 13) is a dedicated reasoning mode that gives the model maximum thinking time—high latency is the tradeoff.

3.1 Pro is the practical mode, with normal latency. It uses some distillation techniques to transfer Deep Think’s reasoning capability, retaining most of the reasoning gains while returning to acceptable latency.

Meaning for Developers

77.1% on ARC-AGI-2 means this model is quite capable on complex abstract reasoning tasks. A viable choice for products requiring understanding of complex rules and multi-step inference.

44% hallucination rate is still not low, but for use cases involving long document understanding and multi-step reasoning, 3.1 Pro is substantially more usable than the previous generation.