The competition between AI weather models and traditional numerical weather prediction has, by most measures, been settled. Google DeepMind's GenCast outperforms the European Centre for Medium-Range Weather Forecasts' ensemble system — long the gold standard of global forecasting — on predictions up to fifteen days out. Microsoft's Aurora beat the National Hurricane Center on five-day tropical cyclone track forecasts, the first time a machine-learning model has achieved that. DeepMind's WeatherNext 2 runs eight times faster than its predecessor, enabling forecasters to analyse far more possible scenarios per run and improving predictions of low-probability catastrophic events. Nvidia has released open-source software for building AI forecasting systems, bringing these capabilities within reach of governments and agencies that previously could not afford to run competitive global models. The headline result is clear: for the kinds of weather events that historical data captures well, AI has arrived.
The complication is precisely in that phrase "historical data." All of these models are trained on decades of past weather observations — temperature records, pressure readings, wind measurements, humidity profiles — and they learn the statistical patterns that make those observations predictable. They are, in essence, very sophisticated pattern-matchers trained on climate as it was. Climate change is systematically altering the distribution of what weather actually does. More intense heat events, more extreme precipitation, shifts in jet stream behaviour, altered hurricane tracks — these are the events that matter most for human societies, and they are, by definition, events that the training data underrepresents because they are becoming more common precisely because the climate is changing.
Scientific American's analysis of this problem points to a fundamental tension in data-driven forecasting: the better a model is at matching historical patterns, the worse it may be at generalising to a climate regime that those patterns no longer fully describe. Traditional numerical weather prediction models are built on physical equations — the Navier-Stokes equations, thermodynamics, fluid dynamics — that describe atmospheric behaviour from first principles. They are not trained on historical data in the machine-learning sense; they simulate physics. When the climate shifts, a physics-based model needs updated inputs and boundary conditions, but its core equations remain valid. An AI model trained on historical data has no such anchor. Its "knowledge" of what weather does is encoded in weights tuned to a distribution that is changing beneath it.
This does not mean AI weather forecasting is a mistake — it clearly is not, and the performance gains are real and valuable for the vast majority of forecasting tasks. But it does mean the transition from numerical to AI-based forecasting should be managed with the specific vulnerability in mind. The models that are currently beating traditional systems on benchmark comparisons are being evaluated against historical test sets. The question of how they perform on genuinely unprecedented events — events outside the training distribution not because they are random flukes but because the climate has shifted into new territory — is harder to answer in advance. By the time the answer is clear, the models will have been operationalised at scale.
Bloomberg's feature on AI weather models captures the optimism among forecasters: faster, cheaper, more accessible models that can run more scenarios per forecast are genuinely useful, and the accuracy improvements are meaningful in terms of lead time for evacuation decisions, agricultural planning, and energy grid management. Nvidia's open-source platform makes these capabilities available to national meteorological services that previously relied on expensive proprietary infrastructure. The democratisation of high-quality forecasting is a real and significant benefit. The research community is not unaware of the distributional shift problem — there are active efforts to retrain models on climate projections rather than purely historical data, and to combine AI with physics-based components to improve generalisation. But these efforts are works in progress, and the operational deployment of AI forecasting systems is proceeding faster than the research on their limits.
The deeper issue is one that recurs throughout AI deployment in high-stakes settings: the benchmark that demonstrates a system's superiority may not be the benchmark that matters most when the system fails. AI weather models are winning on average accuracy across all events. The tails of the distribution — the extreme events, the novel situations, the cases where the training data offers the least guidance — are where the risk lies, and where the consequences of failure are largest. That is not a reason to halt deployment; the improvements in typical-case forecasting save lives and resources. But it is a reason to maintain physics-based systems as backstops rather than retiring them, and to be appropriately cautious about claims that the forecasting problem is solved.