Google dropped Gemma 4 on April 3, 2026, and the headline isn’t the benchmark numbers — it’s the licence. After years of frustrating developers with a restrictive custom licence, Google switched Gemma 4 to Apache 2.0. That changes the commercial calculus entirely.

Four models, runs on your hardware, free to use in production. If you’ve been watching the open-source AI space, here’s what you need to know.

As part of the best AI models in 2026, open-source options have been closing the gap with proprietary ones faster than most people expected. Gemma 4 is the sharpest example of that trend yet.

What Is Gemma 4?

Gemma is Google’s line of open-weight models — the publicly available distillation of the research and architecture that powers Gemini. Gemma 4 is the fourth generation, built directly on Gemini 3 technology. That’s not marketing copy. It means the same research advances in Google’s flagship proprietary model are now available to run on your own hardware.

The developer community has downloaded Gemma models over 400 million times across the full family, producing more than 100,000 variants. There’s real adoption here, not just launch-day fanfare.

The Four Gemma 4 Models Explained

Gemma 4 comes in four sizes, each targeting different hardware:

Model Active Params Target Hardware Context Window Audio Input
E2B (Effective 2B) 2B Smartphones, Raspberry Pi, IoT 128K tokens Yes
E4B (Effective 4B) 4B Smartphones, Raspberry Pi, Jetson Nano 128K tokens Yes
26B MoE 3.8B active (26B total) Developer workstations, consumer GPUs 256K tokens No
31B Dense 31B Single H100 (unquantized) or consumer GPU (quantized) 256K tokens No

The “effective” naming on the edge models reflects how they work: engineered for maximum memory efficiency, activating fewer parameters during inference to preserve RAM and battery. All four process images and video natively. The two edge models also handle audio for speech recognition.

The 26B MoE is architecturally interesting. It has 26 billion total parameters, but only activates 3.8 billion during inference — the Mixture of Experts design routes tokens to specialised parameter subsets on the fly. Result: fast tokens-per-second because you’re not running all 26B parameters on every token.

The 31B Dense is the quality-first option. Slower, but all 31B parameters are always active. Google positions it as the better foundation for fine-tuning — richer model, more to adapt.

Hardware Requirements: What Do You Actually Need?

Concrete numbers:

  • 31B Dense, unquantized (bfloat16): Single 80GB NVIDIA H100. ~$20,000+ GPU. Not for most people.
  • 31B Dense, quantized: Consumer GPU — an RTX 4090 or 3090 with 24GB VRAM is realistic at Q4/Q5. Speed takes a hit but it’s usable.
  • 26B MoE: Same pattern — H100 unquantized, consumer GPU quantized. The 3.8B active parameter footprint means better tokens/sec than the 31B at the same hardware.
  • E4B: Android, Raspberry Pi, NVIDIA Jetson Orin Nano. Genuinely offline-capable on a $50 Pi 5.
  • E2B: Same edge targets. Most memory-constrained.

Google worked directly with Qualcomm and MediaTek to optimise the edge models for modern mobile chips, and they’re forward-compatible with the AICore Developer Preview on Android. This isn’t “runs on mobile eventually” — it was co-engineered with the chip makers from the start.

The Apache 2.0 Licence Switch — Why This Actually Matters

This deserves more coverage than the benchmarks.

Previous Gemma models shipped under a custom Google licence that made legal teams uneasy. The old terms let Google update the prohibited-use policy unilaterally. They required developers to enforce Google’s rules on any downstream projects. Some readings even extended licence obligations to other AI models trained on synthetic data from Gemma. For anything commercial or production-critical, those terms were a genuine barrier.

Gemma 4 ditches all of it for Apache 2.0. If you’ve shipped software before, you know Apache 2.0 — use it commercially, modify it, redistribute it, include it in proprietary products. No royalties, no overbearing terms. You own your infrastructure and your weights.

This removes the single biggest adoption barrier for enterprise use of Gemma. Teams that stayed on the sidelines because of licence risk can now build on it cleanly.

Gemma 4 Benchmark Performance

The “outcompetes models 20x its size” claim comes from the Arena AI leaderboard — and the data supports it:

  • Arena AI ELO: 31B = 1452, 26B = 1441. The 31B ranks #3 globally among open models. The 26B is #6. The models ahead of them (GLM-5, Kimi 2.5) are substantially larger.
  • AIME 2026 (maths): 31B = 89.2%, 26B = 88.3%
  • LiveCodeBench v6 (coding): 31B = 80.0%, 26B = 77.1%
  • GPQA Diamond (science): 31B = 84.3%, 26B = 82.3%
  • τ²-bench agentic tool use: 31B = 86.4%, 26B = 85.5%

For context: Gemma 3 27B scored 20.8% on AIME 2026 and 6.6% on the agentic benchmark. The jump generation-over-generation is significant, especially on reasoning and agentic tasks.

The edge models are more modest — but running a model that scores 52% on LiveCodeBench offline on a smartphone is genuinely new territory. E4B hits 57.5% on the agentic benchmark. That’s real capability in a device footprint.

Agentic Workflow Support

Worth calling out specifically: Gemma 4 has native support for function calling, structured JSON output, and system instructions for common tools and APIs. Baked in, not bolted on.

For anyone building AI agents — models that plan and execute multi-step tasks using tools — having a local model that reliably calls APIs, parses responses, and chains actions changes your architecture options. You don’t need a cloud dependency for the orchestration layer.

How Gemma 4 Compares to the Competition

vs Llama 4 (Meta): Both target the open-source developer market with similar timing. Gemma’s edge-first story is stronger — Google co-engineered with Qualcomm and MediaTek. Llama has a larger ecosystem and more community tooling. For mobile or IoT deployments, Gemma 4 is the more compelling choice. Server-side, both are competitive; pick based on your existing stack.

vs Mistral: Strong for European deployments and production fine-tuning. Gemma 4’s 140+ language support is broader. The 26B MoE architecture is conceptually similar to Mistral’s approach. This one comes down to fine-tuning needs and what your team already knows.

vs cloud models (Gemini, Claude, GPT-4o): Not the same league in terms of raw context — Gemma 4’s 256K token window looks good for a local model, but Gemini cloud offers 1M+. The pitch for Gemma 4 isn’t “beats cloud AI.” It’s privacy, cost at scale, and offline access. If you’re processing sensitive data that can’t leave your infrastructure, or running inference at a volume where per-token cloud costs are a real line item, local Gemma 4 is a serious option. For general use where cloud is fine, proprietary options like Claude and ChatGPT still have context length and ecosystem advantages.

Who Should Use Gemma 4?

Mobile app developers: The E2B and E4B models are the most interesting thing in this release for mobile. Offline AI with real capability — OCR, speech recognition, multimodal understanding — co-engineered for Android’s AICore framework.

Local coding assistant builders: The 31B Dense on a quantized consumer GPU is a viable offline coding assistant. Useful if you can’t or don’t want to send code to an external API. Google explicitly positions this as a local-first alternative to cloud IDE integrations.

Agent and automation builders: Native function calling and JSON output make Gemma 4 a clean local backbone for agentic workflows — particularly where data sovereignty matters.

Researchers and fine-tuning teams: Apache 2.0 plus the 31B Dense architecture is a solid base. Google has examples: Yale used Gemma for cancer therapy discovery research; INSAIT built a Bulgarian-first language model on it.

Enterprise teams re-evaluating open-source AI: Apache 2.0 removes the licence risk that kept Gemma out of production environments. If your legal team said no to Gemma 3, that answer changes now.

How to Get Started

The models are live:

  • Hugging Face: Published at gg-hf-gg/gemma-4-* — all four variants
  • Ollama: ollama pull gemma4 — simplest path for local testing
  • Google AI Studio: Try Gemma 4 31B at aistudio.google.com without local hardware
  • Android AICore Developer Preview: Available now for forward-compatibility testing on mobile

FAQ

What is the difference between Gemma 4 and Gemini?
Gemini is Google’s proprietary, cloud-hosted AI family. Gemma 4 is open-weight — you download the model weights and run them yourself. Gemma 4 is built on the same research as Gemini 3, but it’s smaller, has a shorter context window, and you host it yourself. Gemini is more capable for most tasks; Gemma is the option when you need local control, data privacy, or commercial freedom without per-token costs.

Can Gemma 4 run on a consumer GPU?
Yes, with quantization. The 31B and 26B models need an H100 to run unquantized. At 4-bit or 5-bit quantization, they’ll run on a 24GB consumer GPU like an RTX 4090 or 3090. The E2B and E4B edge models are designed for smartphones and single-board computers.

What licence does Gemma 4 use?
Apache 2.0. Commercial use is permitted, modification is permitted, redistribution is permitted. A significant change from the restrictive custom licence on Gemma 3 and earlier.

How does Gemma 4 compare to Llama 3 or 4?
Both are strong. Gemma 4’s edge model story is more developed for mobile and IoT, with direct chip manufacturer co-engineering. Llama has a larger established ecosystem. On server-side benchmarks, both are competitive. The choice typically comes down to deployment target and existing tooling.

Is Gemma 4 free to use commercially?
Yes. Apache 2.0 permits commercial use with no royalties or restrictions. Fine-tune it, ship it in a product, deploy it in a paid service — all fine.

What is the context window for Gemma 4?
128K tokens for E2B and E4B; 256K tokens for the 26B MoE and 31B Dense. Competitive for a local model, but shorter than cloud Gemini models (1M+ tokens).

Can Gemma 4 process images and audio?
All four models process images and video natively. The E2B and E4B edge models also support audio input for speech recognition. The 26B and 31B are vision-only — no audio input.

What This Means for Open-Source AI

Gemma 4 lands at a moment when the open-source AI gap is closing faster than most forecasts suggested. A 31B model ranking #3 globally among open models, released under Apache 2.0, with edge variants that run offline on a Raspberry Pi — that picture would have been optimistic two years ago.

The Apache 2.0 switch signals something about how Google is thinking about this market. They’re not using Gemma as a licensing moat. They want adoption. And adoption means giving developers something they can build on without legal uncertainty.

For IT teams evaluating their AI stack: Gemma 4 belongs in the conversation for workloads where data sovereignty matters, where cloud API costs at scale are a real concern, or where offline capability is a requirement. It’s not a replacement for frontier cloud models on every task — but for a lot of real-world use cases, “good enough plus local” beats “best but cloud-only.”