For most of AI's recent history, the models worth talking about have lived in data centres. Running a genuinely capable AI meant sending your query to a server farm somewhere, waiting for a response, and accepting that the compute involved was expensive, energy-hungry, and entirely outside your control. Google's release of Gemma 4 this week is the latest and clearest sign that this is changing. The smallest model in the family runs on a phone. The largest ranks third among all open models in the world. Both are free to use and modify under an Apache 2.0 licence.
Gemma 4 comes in four sizes: a 2B and 4B model designed for edge devices, a 26B Mixture of Experts model, and a 31B dense model. The headline benchmarks are notable: the 31B sits at number three on the Arena AI text leaderboard, which tracks real-world user preference rather than curated tests, and the 26B takes sixth. But the more interesting number is how those rankings were achieved. The 26B MoE model activates only 3.8 billion parameters at inference time. It reaches near-top-tier performance while doing the computational work of a model less than a sixth its size.
That efficiency matters for reasons beyond cost. It is what allows the smaller Gemma 4 models to run entirely offline on a Raspberry Pi, a phone, or an NVIDIA Jetson Orin Nano. Near-zero latency, no internet connection required, no data leaving the device. For developers building anything that touches privacy-sensitive information, or that needs to work in low-connectivity environments, that changes the design space substantially.
Google has been releasing Gemma models since early 2024, and the pattern is consistent: each generation offers a version of the research and architecture behind its proprietary Gemini models, open-weighted and freely available. Gemma 4 is built from the same foundation as Gemini 3. This creates an interesting dynamic. Google is effectively giving away, on a roughly one-generation lag, the technology it also sells as a premium API service. The open release competes, at least partially, with its own product.
The rationale is not purely altruistic. Open models build an ecosystem around Google's tools and infrastructure. Developers who learn on Gemma tend to deploy on Google Cloud. Open weights attract research collaborations and goodwill that closed models do not. And by anchoring the open-source conversation to its own model family, Google shapes what the broader community optimises for. Meta's Llama series plays the same game. The open release is a business strategy as much as a technical contribution.
What has changed with Gemma 4 is the performance ceiling. Earlier open models were useful but clearly second-tier compared to the best proprietary offerings. A 31B model that ranks third in the world, with weights you can download from Hugging Face and run on your own hardware, is a different proposition. The gap between what you can do with a cloud API and what you can do with local compute has narrowed to the point where for many applications it no longer meaningfully exists.
The open-source AI race is not slowing down. It is accelerating, and the capability threshold that justifies paying for proprietary access keeps rising. Gemma 4 is another step in a direction that has been clear for some time: frontier-quality AI is becoming a commodity, and the business models built on scarcity are running out of runway.