Google released Gemma 4 on April 3, 2026 - four open-weight multimodal models under Apache 2.0, from mobile E2B to 31B dense, topping open leaderboards for reasoning and agents.
Google dropped Gemma 4 on April 3, 2026, and it is the most significant open-model release the company has made yet. Four new models - E2B, E4B, 26B MoE, and 31B dense - ship under Apache 2.0 licensing, run across everything from a Raspberry Pi to a data-center GPU, handle text, images, and audio, and the flagship 31B currently ranks as the #3 open model on the Arena AI text leaderboard. If you build AI-powered products and you are not using a cloud API for privacy or cost reasons, Gemma 4 is the most serious open-weight option available today.
Google DeepMind published the Gemma 4 family via its official blog on April 3, 2026. The models are built directly from research conducted for the proprietary Gemini 3 series and are available immediately on Hugging Face, Kaggle, Ollama, Google AI Studio, and the AI Edge Gallery. Both pre-trained base weights and instruction-tuned variants are included at no cost under Apache 2.0.
The Apache 2.0 license is not a minor footnote. Previous Gemma releases carried more restrictive usage terms that blocked certain commercial applications. Hugging Face co-founder Clément Delangue called the switch "a huge milestone" because it opens the door to enterprise and commercial deployments that were not possible before.
Google is shipping four distinct models, each targeting a different deployment context:
| Model | Type | Context Window | Modalities | Best Deployment |
|---|---|---|---|---|
| E2B (Effective 2B) | Dense | 128K tokens | Text, image, audio | Phones, Raspberry Pi |
| E4B (Effective 4B) | Dense | 128K tokens | Text, image, audio | Phones, Jetson, laptops |
| 26B MoE (A4B) | Mixture-of-Experts | 256K tokens | Text, image | Consumer GPUs, servers |
| 31B Dense | Dense | 256K tokens | Text, image | Workstation GPUs, cloud |
The naming convention for edge models reflects effective parameter count rather than raw parameter count - the "E" prefix signals these are optimized for efficiency, not just compressed versions of larger models. The 26B MoE activates roughly 4B parameters per token (hence the A4B designation), giving it speed closer to a small model with quality closer to the full 26B.
Availability
Open leaderboard rankings matter, but the license change may have longer-term impact on how businesses actually adopt these models. The prior Gemma terms created legal ambiguity for commercial products - particularly for agencies and SaaS builders embedding models into customer-facing applications. Apache 2.0 removes that friction entirely.
As CIO Dive noted in its analysis, "open source models can be more easily tailored to specific business use cases and allow for more control over data and infrastructure." That control is particularly valuable for industries handling sensitive data - healthcare, legal, finance - where routing data to a third-party cloud API is either risky or prohibited. A locally-run Gemma 4 31B on your own GPU means no data leaves your infrastructure.
The historical parallel is instructive: when Llama 3 launched under Apache 2.0, fine-tune variants proliferated within days on Hugging Face. The Gemma ecosystem already has 100K+ variants from prior versions. Expect that number to accelerate.
The 31B dense model is now a credible replacement for cloud API calls in workflows where latency is tolerable and data privacy is paramount. With native function calling and JSON output, you can plug it directly into agentic pipelines - think automated document processing, code review bots, or multi-step research agents - without paying per-token fees. The 256K context window is large enough to handle most real-world document tasks in a single pass.
E2B and E4B running multimodal inference - including audio - on a phone is genuinely new territory for open models. If Google's self-reported 4x speed and 60% battery efficiency figures hold up under independent testing, these become viable for real-time on-device features: voice-to-action agents, image analysis, offline translation in 140+ languages. The Android AICore developer preview is available now for prototyping. Treat the efficiency claims as targets to verify, not guarantees - no third-party benchmarks have published results yet as of April 4, 2026.
The cost math is straightforward: Gemma 4 models are free to download and run. Your costs are hardware and electricity. A team running hundreds of thousands of API calls per month at $3-15 per million tokens (typical for comparable proprietary models) can recoup a consumer GPU purchase within weeks. The 26B MoE is particularly interesting here - activating only 4B parameters per token means you get near-31B quality at near-4B inference cost on your hardware.
Quick Start via Ollama
According to Google's launch blog, the 31B dense model ranks #3 on the Arena AI text leaderboard among all open models - a human-preference ranking that is generally more meaningful than narrow academic benchmarks. The 26B MoE sits at #6. For reference, models like Llama 3.1 70B and Mistral Large 2 operate at significantly higher parameter counts to achieve comparable rankings.
Google DeepMind's own framing is that Gemma 4 delivers "an unprecedented level of intelligence-per-parameter." That claim is consistent with the leaderboard data but should be contextualized: Arena rankings measure general chat quality, not coding ability, tool use, or domain-specific tasks. Independent evaluations on reasoning chains, multi-step function calling, and code generation have not yet been published as of this writing. Expect the developer community to fill that gap over the coming week.
Gemma 4 arrives ahead of anticipated updates from Meta (Llama 4) and Mistral. The Apache 2.0 move directly mirrors Llama 3's licensing strategy and puts pressure on models under more restrictive Creative Commons non-commercial terms. When licensing friction disappears, adoption accelerates - and more adoption means more fine-tunes, more tooling support, and ultimately more real-world validation.
The edge story is particularly notable. Google confirmed hardware partnerships with Qualcomm and MediaTek for the E2B and E4B models, which means optimized drivers and on-device acceleration are part of the roadmap - not an afterthought. This positions Gemma 4 edge models against Apple's on-device AI work and Qualcomm's own AI platform, with the open-weight advantage that neither Apple nor Qualcomm can match.
The Gemini Nano 4 integration for Android - slated for later in 2026 - will determine whether the edge story reaches mainstream Android developers or stays in the prototyping tier. That timeline means the AICore developer preview available now is the opportunity for early movers to build and ship before broader platform support arrives.
Caveat on Efficiency Claims
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders