NVIDIA releases Gemma 4 multimodal models optimized for Blackwell, H100, and Jetson devices using NVFP4 quantization - cutting model size roughly in half for edge AI deployments.

NVIDIA has published optimized builds of Google's Gemma 4 multimodal model family for Blackwell GPUs, H100s, and Jetson edge devices - complete with a new 4-bit floating-point quantization format called NVFP4 that cuts the 31B model's memory footprint roughly in half while claiming near-8-bit accuracy. BF16 checkpoints are available on Hugging Face right now. NVFP4 quantization for the 31B variant is available via NVIDIA Model Optimizer for vLLM on Blackwell hardware, with the remaining variants listed as "coming soon."
Community discussion is essentially nonexistent at this point - no Reddit threads, no X takes, no Hacker News posts. That is consistent with a launch that just dropped and targets a relatively narrow audience: teams with actual Blackwell or Jetson hardware in hand. If you are in that group, here is what you need to know.
NVIDIA published four Gemma 4 variants with hardware-specific optimizations across its product line. All four are multimodal (text and vision), support 140+ languages, and ship under the Apache 2.0 license - meaning you can use them commercially without licensing fees.
| Model | Architecture | Target Hardware | NVFP4 Available |
|---|---|---|---|
| Gemma 4 31B | Dense | Single H100 / Blackwell | Yes (via Model Optimizer) |
| Gemma 4 26B-A4B | MoE (4B active) | Single H100 | Coming soon |
| Gemma 4 E4B | Edge | Jetson Orin Nano / Thor | Coming soon |
| Gemma 4 E2B | Edge | Jetson Orin Nano / mobile | Coming soon |
The MoE (Mixture of Experts) architecture on the 26B-A4B is worth flagging: the model has 26 billion total parameters but only activates roughly 4 billion per token. That means H100-class memory with a fraction of the compute cost at inference time.
Most people who run local models have seen INT4 or INT8 quantization - both trade numeric precision for a smaller model size. NVFP4 is NVIDIA's take on 4-bit quantization, but it uses a floating-point representation rather than an integer one, which gives it more dynamic range for handling extreme values in model weights.
The technical specifics: NVFP4 stores each weight as a 4-bit float (E2M1 format - 2 exponent bits, 1 mantissa bit) plus a two-level scaling system. Every 16 weights share an E4M3 (8-bit float) scaling factor at the block level, and a separate FP32 scaling factor applies across the entire tensor. This two-level approach is what NVIDIA says allows it to preserve near-8-bit accuracy despite working at 4-bit precision.
Critically: NVFP4 is native to Blackwell Tensor Cores. On non-Blackwell hardware, you will not get the hardware acceleration benefits. You can still run BF16 models on H100, or use FP8 quantization on Hopper-era GPUs - but the NVFP4 speedup is a Blackwell-exclusive feature.
How NVFP4 Compares to MXFP4
NVIDIA and third-party explainers have put forward several performance claims. These all originate from NVIDIA or sources citing NVIDIA's own data - no independent third-party benchmarks exist yet for Gemma 4 specifically.
The "50x energy efficiency vs. H100" figure stands out and warrants caution. NVIDIA has not published the specific workload, model size, or batch configuration behind that number. Treat it as a ceiling-case marketing claim until you can benchmark against your actual use case.
The practical value here depends entirely on your hardware situation. There are three distinct profiles worth separating:
You can pull the BF16 Gemma 4 31B and 26B-A4B checkpoints from Hugging Face today and run them on a single H100 (80GB). Both models are confirmed to fit. This is a real advantage: a 31B dense multimodal model that handles 140 languages and vision tasks on one GPU is operationally useful for prototyping multilingual chatbots, document analysis pipelines, or vision-language applications without routing through a cloud API.
The cost comparison vs. Gemini API calls depends on your volume, but for latency-sensitive or privacy-sensitive use cases, on-prem on a rented H100 is worth running the math on.
This is where the release is most meaningful. NVIDIA Model Optimizer can quantize the 31B to NVFP4 for vLLM right now, giving you the ~2x memory reduction and native Tensor Core acceleration. For teams running inference at scale on B100/B200/DGX Spark hardware, this is a real throughput gain per dollar of compute.
The E2B and E4B variants target Jetson Orin Nano and Jetson Thor devices - hardware that runs in the $500 range for the Orin Nano developer kit. These are the models you would deploy for on-device AI in robotics prototypes, industrial inspection cameras, or mobile applications that cannot rely on network connectivity. NVFP4 quantized versions of the edge variants are not yet available, but the BF16 E2B/E4B builds are there to prototype with now.
NVFP4 Is Blackwell-Only
This release is part of a consistent pattern from NVIDIA. Each GPU generation has pushed the viable precision floor lower: FP16 on Pascal, BF16 on Ampere, FP8 on Hopper, now FP4 on Blackwell. Each step roughly doubles the model you can run on the same GPU - or halves the GPU cost to serve the same model.
The competitive read is that NVIDIA wants Gemma 4 - Google's open-weights multimodal model - to be the default choice for teams building on Blackwell and Jetson, ahead of Meta's Llama family and Mistral. NVIDIA already optimized Llama Nemotron Super for B200 FP4, claiming 6x throughput vs. H100 FP8. Gemma 4 NVFP4 extends that same playbook to a different model family, giving Blackwell buyers more reasons to commit to the ecosystem.
For buyers still on H100/H200, this announcement is less urgent - but it does accelerate the depreciation clock on that hardware for inference-heavy workloads. The gap between H100 FP8 and Blackwell FP4 inference efficiency is wide enough that mid-sized teams planning infrastructure in 2025-2026 should model out the upgrade math.
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders