Gemini 2.5 Pro Beats Pokémon Blue, Marking AI Milestone in Strategic Gameplay and Reasoning

Gemini 2.5 Pro Beats Pokémon Blue, Marking AI Milestone in Strategic Gameplay and Reasoning
Gemini 2.5 Pro Beats Pokémon Blue, Marking AI Milestone in Strategic Gameplay and Reasoning

In a significant milestone for AI performance, Google’s most advanced model, Gemini 2.5 Pro, has completed the classic 1996 GameBoy title Pokémon Blue. The news was enthusiastically shared by Google CEO Sundar Pichai on social media, celebrating the accomplishment with the words, “What a finish! Gemini 2.5 Pro just completed Pokémon Blue!” While Google itself didn’t orchestrate the gameplay, the achievement was made possible through the independent efforts of Joel Z, a software engineer who set up and streamed the AI’s gameplay experience.

Gemini’s Gameplay Sparks Interest at Google, Inspired by Claude’s Pokémon AI Challenge

Though Joel Z is not affiliated with Google, his project—Gemini Plays Pokémon—has gained the admiration and encouragement of top Google executives. Logan Kilpatrick, Google AI Studio’s product lead, had previously posted about Gemini’s steady progress through the game, noting the model’s acquisition of its fifth badge. This prompted playful commentary from Pichai, who quipped about building “API” — Artificial Pokémon Intelligence. This public attention from high-level Google figures underscores the tech giant’s interest in AI’s potential to solve complex, open-ended tasks.

Gemini 2.5 Pro Beats Pokémon Blue, Marking AI Milestone in Strategic Gameplay and Reasoning
Gemini 2.5 Pro Beats Pokémon Blue, Marking AI Milestone in Strategic Gameplay and Reasoning

The choice to use Pokémon Blue as a testing ground for AI reasoning isn’t arbitrary. Earlier this year, rival AI company Anthropic used Pokémon Red to demonstrate its Claude models’ progress in agent-based reasoning tasks. Classic Pokémon games require extended memory, strategic planning, and adaptability, making them ideal testbeds for evaluating large language models’ (LLMs) decision-making capabilities. Joel Z credited the Claude Plays Pokémon Twitch channel as a source of inspiration for his Gemini-powered version.

Comparing AI Models is Complex: Different Tools, Inputs, and Evolving Frameworks Shape Outcomes

Despite Gemini’s victory, Joel Z cautioned against drawing direct comparisons between AI models. Both Claude and Gemini rely on “agent harnesses” — systems that feed them enriched game data (like annotated screenshots) and help them determine next steps in gameplay. These harnesses vary across models, making standardized benchmarking difficult. Joel emphasized that different models use distinct toolsets and receive different levels of support, rendering one-to-one comparisons unreliable.

Joel Z acknowledged that he provided occasional interventions to guide Gemini but insisted these were not cheats. Instead of offering specific instructions or walkthroughs, his role was to enhance the model’s overall reasoning. One notable exception involved pointing out a game bug workaround involving a Team Rocket character—something later resolved in a different game version. He also stressed that Gemini Plays Pokémon is still an evolving framework, with future improvements expected as both the AI model and its supporting tools advance.