What Gemma 4 meant for the world

Gemma 4 runs on consumer hardware, beats models four times its size, and made local AI viable. But the version you download isn’t the one that produced the benchmarks, and “open weights” still doesn’t mean “open source.”

📋 Source: Google DeepMind, Hugging Face, llama.cpp

Where we’re headed

If you’ve ever tried running a decent AI model locally and given up, Gemma 4 deserves your attention. It’s not just a bump in specs - it reshapes what’s possible on consumer hardware and raises uncomfortable questions about what “open” really means in the AI world. We’ll walk through the performance numbers, the architecture behind them, what it means for local AI, and the gap between the benchmarks and what you actually get.

The numbers

The 31B Dense model reached #3 on the Arena AI leaderboard among open models. Few would have bet on that for a model of that size. But the direct comparisons with Gemma 3 tell the real story:

Benchmark	Gemma 3 27B	Gemma 4 31B	Delta
AIME 2026	20.8%	89.2%	+330%
LiveCodeBench v6	29.1%	80.0%	+175%
Codeforces ELO	110	2,150	+1,855%
τ2-bench Retail	6.6%	86.4%	+1,209%

On GPQA Diamond, the expert-level benchmark in physics, chemistry, and biology, Gemma 4 31B reaches 84.3%, surpassing even GPT-OSS 120B at 76.2%. A 31-billion-parameter model outperforming models four times its size. On MMMLU, the multilingual benchmark, it reaches 85.2%.

These aren’t marginal gains. Something in the architecture and training worked in a fundamentally different way.

The local angle

The 31B Dense model runs in full, without quantization, on a single NVIDIA H100 80 GB. Quantized versions (Q4, Q8) run on consumer GPUs with 24 GB of VRAM, like an RTX 4090 or RTX 5070. The E2B model runs on embedded and IoT devices. And the 26B MoE needs only 6 GB of RAM on Apple Silicon, producing 49 tokens per second.

NVIDIA collaborated directly with Google to optimize Gemma 4 on RTX, DGX Spark, and Jetson Orin Nano. This isn’t coincidental: local AI is no longer a niche hobbyist area, it’s a strategic market.

Gemma 4 is also compatible with local AI agent frameworks like OpenClaw, enabling autonomous agents that access local files, workflows, and tools. An open-weight model serving as the brain of a fully local, private AI agent isn’t theory anymore - it’s something you can set up today.

Try it yourself - 31B Dense: 24 GB VRAM, run with `ollama run gemma4:31b` - 26B MoE: 6 GB RAM on Apple Silicon, ~49 token/s on base Mac Mini M4 - E2B: runs on embedded and IoT devices - Expected context: 256,000 tokens

Open-weight, not open-source

Here’s where the narrative gets complicated. Gemma 4 ships under Apache 2.0, commercial, no restrictions. The weights are downloadable. But “open weights” isn’t “open source,” and the distinction matters.

Three problems:

Training data isn’t public. We don’t know what datasets, filters, or biases were used. Without this, full reproducibility is impossible.
The thinking mode was distilled away. All Gemma 4 models have a structured reasoning mode that can generate over 4,000 tokens of thought before responding. This is the mode that produced the exceptional benchmark scores. But the full “thinking” version was removed before public release. The weights you download don’t include this capability.
The tooling is fragile. Within 72 hours of release, 47 crash reports accumulated in llama.cpp. Initial support was absent in llama.cpp, vLLM, and SGLang. The hybrid architecture, MoE, and native multimodality all required significant updates. The 26B MoE model “thought” even without the activation token, generating hidden reasoning that consumed tokens and confused applications.

Presenting scores obtained with a version of the model that isn’t the distributed one creates a gap between marketing and reality. A user downloading Gemma 4 expecting to replicate official benchmarks will be disappointed. And a model that crashes 47 times in three days isn’t production-ready, regardless of how impressive its benchmarks are.

The competitive landscape

Gemma 4 arrives in a crowded market. Qwen 3.5 with 201 languages, Kimi K2.5 with 256K context and agent swarms, GLM-5 with its anti-hallucination record, MiniMax M2.5 with the best quality-to-price ratio, GPT-OSS, Nemotron 3, OLMo 3: no shortage of choices.

But the real competition isn’t about benchmark scores. It’s about the ecosystem: tooling, fine-tuning, integration, community. Google plays a subtle game here. Making Gemma 4’s weights accessible isn’t philanthropy - it’s a funnel strategy toward Gemini. Gemma 4 is built on Gemini 3’s research: the open-weight is the secondary product of a much larger investment, designed to build a community and ecosystem that makes it harder for developers to leave Google’s orbit.

Meta does the same with Llama. Open-weight as a commercial strategy isn’t new, but it’s the first time competition on this front has been this intense. The winner won’t be whoever has the best model, but whoever builds the most robust ecosystem around it.

The Gemmaverse

Over 400 million downloads across all Gemma generations. More than 100,000 fine-tuned variants. Hugging Face, Google AI Studio, and Kaggle as distribution channels. It’s a mature ecosystem, with developers customizing models for medicine, finance, education, and cybersecurity.

But the scale raises a question: if so many developers depend on a model whose tooling is still fragile and whose “complete” version isn’t the distributed one, how resilient is this ecosystem really?

For the technically curious

From here on, we’re getting into the architecture. If you care about the idea more than the implementation, you can skip to the conclusion.

Hybrid Attention: 256K context on consumer hardware

Standard transformer attention is quadratic: double the context length and memory usage quadruples. This kept local models stuck on short contexts for years. Gemma 4 adopts a hybrid architecture: a local window of 512 tokens with linear complexity, alternating with sparse global layers that capture long-range dependencies. The result is a 256,000-token context that runs on consumer hardware, not HPC clusters.

For developers building local AI agents, this is a paradigm shift. Entire codebases, complete technical documentation, or very long conversations without running out of memory - all on a card with 24 GB of VRAM.

Native multimodality

Most “multimodal” open-weight models glue a vision encoder on top of a text model. It works, but it’s fragile and limited. Gemma 4 integrates SigLIP end-to-end from training. The E2B and E4B models support text, images, and audio. The 26B and 31B models support text, images, and video. Multimodality isn’t an accessory - it’s part of the model.

This changes practical applications. An AI agent that needs to analyze a screenshot, read a PDF, and understand a short audio clip can do it with a single model, without orchestrating three separate systems. For local AI, where resources are constrained, this integration is valuable.

MoE in the 26B: 97% of the performance at 12% of the cost

The 26B model uses a Mixture of Experts architecture with 16 experts, of which only 2 are activated per token. That means 3.8 billion active parameters out of 26 billion total. Approximately 97% of the 31B Dense’s performance at 12% of the computational cost.

On Apple Silicon with mmap, the 26B MoE requires only 6 GB of RAM and produces 49 tokens per second. On a base Mac Mini M4, this means having a model nearly as capable as the 31B without dedicated hardware. Democratization of high-level AI, made concrete.

The bottom line

Key takeaways:

Gemma 4 runs on consumer hardware and competes with models four times its size - local AI is no longer a prototype
The version you download isn’t the one that produced the benchmarks - the thinking mode was distilled away before release
The open ecosystem around AI models is still fragile: 47 crashes in three days, missing framework support, hidden reasoning behavior
Open-weight is now a commercial strategy, not just philanthropy - Google, Meta, and others are building ecosystems to lock developers in

Local AI became real. But “open source” in the AI model world remains a partial promise: weights are open, tooling isn’t. Benchmarks are impressive, but achieved with a version that isn’t the one you download. The hardware is accessible, but the surrounding ecosystem still needs work.