Contents

Gemini Reasoner: First Model to Surpass Human Average on Complex Reasoning

What Happened

Google DeepMind released Gemini Reasoner on January 5, 2026.

This isn’t a routine Gemini version bump. It’s a model specifically optimized for complex reasoning, with the core breakthrough being cross-modal logical reasoning—the ability to simultaneously understand text, images, and audio, then perform deep semantic inference across these modalities.

The Numbers

From the official benchmark results:

Task Gemini Reasoner Human Average
Scientific hypothesis generation 92.3% ~75%
Causal inference 92.3% ~70%
Long-horizon planning 92.3% ~65%

All three tasks share the same evaluation set, so it’s technically one number repeated. But it’s enough for Google to claim “surpassing human average.”

Real-World Use Case

The interesting part is the case disclosed in the Nature Methods preprint:

Using Gemini Reasoner to assist in discovering three potential anti-aging compound targets.

The pipeline: input candidate molecules’ bioactivity data and known protein interaction networks, the model outputs hypotheses about protein-compound relationships, then human researchers verify. 3 out of 3 hypotheses validated experimentally.

This isn’t a toy demo—it’s a real paper.

How It Differs from o3

OpenAI’s o3 (released December 2024) is also a reasoning model, but the approaches differ:

o3: pure text chain-of-thought reasoning, RL-driven
Gemini Reasoner: native multimodal reasoning, DeepMind's world models approach

o3 is stronger on math and coding tasks; Gemini Reasoner has the edge on scientific reasoning requiring cross-modal understanding.

Significance

This is the first time AI has systematically outperformed human average on complex cross-modal reasoning tasks.

Not surpassing on a multiple-choice benchmark like MMLU, but on scientific hypothesis generation, causal inference, and long-horizon planning—tasks that genuinely test general-purpose reasoning ability.

The last comparable milestone was GPT-4o surpassing humans on MMLU in 2024. But MMLU is multiple choice; the reasoning depth isn’t comparable.

Caveats

Surpassing average doesn’t mean it can fully replace human researchers in actual scientific workflows. The current pipeline is “model generates hypotheses, humans verify”—humans are still in the loop.

Also, that 92.3% figure only exists on DeepMind’s self-reported benchmark. No third-party reproduction yet.