Microsoft announced the Maia 200 in January — a custom AI accelerator built on TSMC's 3nm process, with 216GB of HBM3e memory at 7 terabytes per second bandwidth, described by the company as a "silicon workhorse designed for scaling AI inference." Meta followed in March with the commercial rollout of its MTIA chip into data centres. Google's seventh-generation Ironwood TPU, released in late 2025 and now running Gemini workloads at scale, has been described internally as the company's "secret weapon" in the AI race. OpenAI, which does not manufacture hardware, committed to deploying 2 gigawatts of Amazon Trainium chips as part of its cloud infrastructure agreement with AWS. In 2026, the story of AI hardware is no longer only about NVIDIA — it is about a parallel bet that the most valuable part of the AI infrastructure stack is shifting from training to inference, and that inference is where custom silicon makes the most economic sense.
The distinction between training and inference matters because they have different cost structures and different performance requirements. Training a large model is a one-time (or infrequent) event that demands raw throughput — the ability to process enormous datasets through billions of parameters as fast as possible. Inference is what happens every time someone uses the model: a query arrives, the model processes it, a response comes out. At the scale these companies operate, inference runs billions of times per day. The economics of inference are therefore dominated not by peak throughput but by cost per query — and cost per query is where specialised chips optimised for a specific workload architecture can substantially undercut general-purpose GPUs.
NVIDIA's H100 and B200 are excellent inference chips as well as training chips, but "excellent at everything" is not the same as "optimal for this specific workload at this specific cost point." Google's TPU programme, now a decade old, was always about this trade-off: accepting some generality in exchange for dramatically lower cost per query on Google's own models running on Google's own infrastructure. Microsoft's Maia series makes the same bet. Meta's MTIA chips are designed specifically for Meta's recommendation and ranking systems — which run continuously at enormous scale and are the company's core revenue infrastructure. Each of these companies has a workload large and stable enough to justify the multi-year, multi-billion-dollar investment required to develop and manufacture a custom chip.
NVIDIA is not standing still. The Rubin platform, announced as the successor to Blackwell, maintains NVIDIA's lead on raw training performance, and the company's software ecosystem — CUDA, cuDNN, and the broader developer toolchain — creates switching costs that hardware performance comparisons alone do not capture. A custom chip that delivers lower inference cost per token is still less attractive if it requires rewriting the model serving stack. Google has spent a decade building a software ecosystem around TPUs; Microsoft and Meta are earlier in that process. The switching costs are real, and they buy NVIDIA time even as the competitive landscape shifts.
The OpenAI-Amazon Trainium commitment is the most interesting data point in the current picture because it involves a company with no chip manufacturing capacity making a large, public bet on a non-NVIDIA chip for its commercial infrastructure. Amazon developed Trainium specifically for training and inference on AWS, and OpenAI's Frontier enterprise platform will run on it. If Frontier succeeds at scale, it will generate data on whether Trainium can handle frontier-model inference workloads competitively — data that will inform the industry's view of the NVIDIA alternative market in ways that internal chip programmes at Google or Meta cannot, because those programmes are not publicly benchmarked against commercial alternatives.
The broader implication is that the AI chip market is bifurcating. Training for frontier models will remain dominated by NVIDIA for the foreseeable future — the performance requirements are extreme, the iteration cycles are short, and the software ecosystem lock-in is deep. Inference, which is where the volume and the recurring cost live, is becoming a contested market where custom silicon from Google, Microsoft, Meta, and Amazon's cloud division each have viable positions. Analysts who track the semiconductor industry note that custom ASICs are growing faster than the GPU market. That growth comes mostly from inference. NVIDIA's revenue is not at risk in the near term; its long-term share of the AI infrastructure spend is a more open question.