NEWS

Google Cloud Brings Gemma 4 to TPUs with GKE Agent Sandbox

Google Cloud now supports Gemma 4 on TPUs via GKE, GCE, and Vertex AI. Here's what operators and builders need to know about access, cost, and the new GKE Agent Sandbox.

Nathan JeanStaff Writer

April 4, 20266 min read

Tweet Share

Google Cloud made Gemma 4 available on its TPU infrastructure on April 4, 2026, opening up the company's latest open-weight model family for serving, pretraining, and fine-tuning on GKE, GCE, and Vertex AI. The announcement also references a GKE Agent Sandbox for secure multi-step AI planning - a feature that could matter for teams building agentic applications on Google Cloud, though details remain sparse at launch. Community discussion is essentially nonexistent as of publication, consistent with an announcement that landed just hours ago and skews heavily enterprise and infrastructure.

What Happened

Google Cloud published a blog post on April 4, 2026, confirming that Gemma 4 - specifically the Gemma-4-31B dense and Gemma-4-26B-A4B MoE (Mixture-of-Experts) variants - is now available on Cloud TPUs through three infrastructure paths: Google Kubernetes Engine (GKE), Google Compute Engine (GCE), and Vertex AI. The MoE architecture uses only 4 billion active parameters despite the 26B total parameter count, which is relevant for inference efficiency on TPU pods.

Cloud Run also gets Gemma-4-31B-it support via NVIDIA RTX PRO 6000 GPUs, offering a serverless GPU path for teams that want Gemma 4 without managing TPU clusters directly.

What's New

Gemma-4-31B and Gemma-4-26B-A4B MoE are now available on Google Cloud TPUs via GKE, GCE, and Vertex AI.
Open-source TPU projects - including vLLM, JetStream, MaxText, and Saxml - are supported for serving, pretraining, and post-training workflows.
Gemma-4-31B-it available on Cloud Run with NVIDIA RTX PRO 6000 GPUs for serverless inference.
GKE Agent Sandbox referenced as a mechanism for secure, multi-step AI planning and sub-second code execution - specific throughput claims are unverified in the official source.
Sovereign deployment options implied through TPU-based data control in controlled environments - explicit air-gapped or sovereign cloud details are not published in the announcement.

On the GKE Agent Sandbox claims

The draft brief references '300 sandboxes/sec' and sub-second LLM code execution. These figures do not appear in Google's official announcement blog as of April 4, 2026. Treat those specifics as unverified until Google publishes supporting documentation or independent benchmarks surface.

Why This Matters for Your Business

If you are already running workloads on Google Cloud and want to self-host a frontier-class open model, Gemma 4 on TPUs is now a real option. The two model variants address different use cases:

Gemma-4-31B (dense) - suited for high-quality generation tasks where you want consistent output quality across all parameters. Cloud Run with RTX PRO 6000 support makes this accessible without cluster management.
Gemma-4-26B-A4B MoE - the MoE design activates only 4B parameters per forward pass, which should yield lower inference latency and cost per token compared to equivalent dense models. For high-throughput, cost-sensitive workloads - document processing pipelines or agent loops making many short calls - this is the more interesting variant.

For teams in regulated industries - healthcare, financial services, legal tech - the prospect of running a 26-31B parameter model inside your own GKE cluster with TPUs, without data leaving your VPC, is genuinely valuable. Most API-based LLM providers cannot offer that level of data control. Google's positioning here aligns Gemma 4 with enterprise data governance requirements, even if the public announcement does not spell out sovereign cloud specifics.

The GKE Agent Sandbox: What We Know (and Don't)

The GKE Agent Sandbox is framed as a secure execution environment for multi-step AI planning - a container-isolated space where an LLM agent can run tool calls, write and execute code, and take multi-turn actions without escaping into broader infrastructure. This is an important safety primitive for agentic workflows. If an agent hallucinates a shell command or a malicious input triggers unexpected code, the sandbox contains the blast radius.

What we cannot confirm yet: how the sandbox integrates with Gemma 4 specifically, what the actual throughput numbers are, and whether this is available in preview or GA. Independent developer evaluations have not surfaced as of this writing. If you are building agentic pipelines on GKE, watch for Google's documentation updates and community tutorials before architecting around this feature.

Access and Cost Reality

Be honest with yourself before diving in. TPU-based deployment is not a weekend project for a solo operator. Here is a realistic breakdown of what access looks like:

Who Can Realistically Deploy Gemma 4 on TPUs

Profile	Path	Barrier	Verdict
Solo builder / indie hacker	Cloud Run (serverless GPU)	Low - no cluster management	Start here with Gemma-4-31B-it
Small agency (2-10 people)	Vertex AI managed endpoint	Medium - needs GCP familiarity	Viable with Google Cloud credits
Mid-size team with MLOps	GKE + TPUv5e + JetStream	High - GKE cluster, TPU quota, regional availability	Strong fit if already on GKE
Enterprise / regulated industry	GKE + sovereign deployment	High - procurement, compliance review	Best data control option available

TPU pricing is not published in the announcement. Historically, Cloud TPUv5e on-demand pricing has run cheaper than equivalent A100 GPU hours for sustained inference workloads, but the break-even depends heavily on utilization. Google Cloud credits for startups can make early experimentation essentially free - check the Google for Startups program if you qualify.

TPU quota and regional availability

TPUs are not available in every GCP region. GKE TPU workloads currently require zonal clusters in supported zones such as us-west1. If your existing GKE setup runs in a different region, plan for a migration or a parallel cluster before committing to this path.

The Bigger Picture: TPUs as a Competitive Moat

Google's pattern here is consistent: open-source the model weights, then make TPUs the most friction-free path to run them at scale. Gemma models serve as a pull strategy for Google Cloud TPU consumption. It worked with earlier Gemma releases and JAX/TensorFlow workloads, and this announcement extends that playbook to a significantly more capable model family.

The competitive read: if you are serving open models on AWS Inferentia or Azure's ND series, Google is actively trying to lure you to TPUs with native model optimization, open-source serving tool support (vLLM, JetStream, MaxText, Saxml), and tighter MLOps integration through Vertex AI. For teams with no existing cloud preference, this is a compelling pitch. For teams already deep in AWS or Azure tooling, the switching costs are real.

No direct competitor response has been published as of April 4, 2026. This announcement dropped the same day as this article, so market reactions will take time to emerge.

How to Get Started Today

Google has published documentation for each deployment path. The fastest starting point depends on your infrastructure maturity:

Cloud Run (easiest): Deploy Gemma-4-31B-it on NVIDIA RTX PRO 6000 via Cloud Run. Serverless, no cluster provisioning. Best for teams prototyping or with inconsistent traffic.
Vertex AI (managed): Use Saxml recipes on Vertex AI for a more managed experience. Recommended for teams that want MLOps tooling without running raw GKE clusters.
GKE + JetStream (performant): Follow Google's GKE TPU tutorial to serve Gemma 4 on TPUv5e pods using JetStream. Highest throughput ceiling, highest setup complexity.
vLLM on TPUs: vLLM now supports Gemma 4 on both TPUs and GPUs. If your team already uses vLLM for OpenAI-compatible serving, this is the lowest-friction path to swap in Gemma 4.

Risks and Caveats

Vendor lock-in: TPU-optimized serving is Google Cloud-specific. vLLM support on both GPUs and TPUs reduces this risk for teams willing to maintain dual configs.
No independent benchmarks: Latency, throughput, and cost-per-token numbers all come from Google. Third-party evaluations are not available yet.
GKE Agent Sandbox maturity: Feature completeness, GA status, and real-world performance for agentic workloads are unconfirmed. Do not architect a production agentic system around this until Google publishes documentation and the community validates it.
TPU quota constraints: TPU quota is not on-demand in the same way GPU instances are. Submit quota requests early if you plan to scale.
Sovereign cloud claims: The official announcement does not detail air-gapped or sovereign cloud configurations. Confirm specific data residency requirements with Google Cloud sales before relying on this for regulated workloads.

Frequently Asked Questions

Can I run Gemma 4 on Google Cloud without managing a GKE cluster?

Yes. Cloud Run now supports Gemma-4-31B-it on NVIDIA RTX PRO 6000 GPUs in a serverless configuration - no cluster management required. Vertex AI also offers a more managed path than raw GKE for teams that want MLOps tooling without infrastructure overhead.

How does the Gemma-4-26B-A4B MoE variant differ from the 31B dense model?

The MoE (Mixture-of-Experts) model has 26 billion total parameters but activates only about 4 billion per inference step. This typically results in lower latency and cost per token compared to a dense 31B model, making it a better fit for high-throughput workloads like document pipelines or agent loops that make many short LLM calls.

What is the GKE Agent Sandbox and is it ready for production?

The GKE Agent Sandbox is described as a secure, isolated execution environment for multi-step AI agent workflows - designed to contain actions taken by an LLM agent so they cannot escape into broader infrastructure. As of April 4, 2026, production readiness and detailed throughput specs are unconfirmed. Wait for official GA documentation and community validation before building production agentic systems on this feature.

Does Gemma 4 on Google Cloud TPUs support sovereign or air-gapped deployments?

Google's announcement implies data control through TPU-based deployment in controlled GKE environments, but the official blog does not specify air-gapped configurations or named sovereign cloud regions. If data residency and isolation are compliance requirements, contact Google Cloud directly to get specific availability and configuration details before committing to an architecture.

Is Gemma 4 available on open-source serving frameworks, or is it Google Cloud-only?

Gemma 4 is supported by vLLM on both TPUs and GPUs, which means you can run it outside of Google Cloud. JetStream, MaxText, and Saxml also support Gemma 4 TPU workloads. The model weights are open-source. Google's TPU integration is an accelerant, not a gate - though TPU-specific optimizations are only available on Google Cloud infrastructure.

Nathan Jean

Staff Writer

Twitter LinkedIn

Stay in the loop

Weekly AI tool reviews, news digests, and how-to guides.

Join 12,000+ builders