← Front Page
AI Daily
Models • March 26, 2026

OpenAI's GPT-5.4 Arrives With a Million-Token Window and a Thinking Mode for Everyone

By AI Daily Editorial • March 26, 2026

OpenAI released GPT-5.4 on March 5, and by most measures it is the most capable frontier model the company has shipped. The headline numbers are striking: record scores on the OSWorld-Verified and WebArena computer-use benchmarks, 83 percent on OpenAI's internal GDPval test for knowledge-work tasks, and a context window of up to one million tokens in the API. That last figure is the largest OpenAI has offered by some distance, and it pushes GPT-5.4 into territory that previously required going to Google's Gemini models to access.

The release comes in three tiers. GPT-5.4 Pro targets people who need maximum performance on complex, multi-step tasks and is available to API customers and ChatGPT subscribers. GPT-5.4 Thinking is the reasoning-focused version, built on the same model but with explicit chain-of-thought steps surfaced to the user; it also lands in ChatGPT as the default "Thinking" mode. GPT-5.4 mini, the lighter weight variant, is rolling out to Free and Go users through that same Thinking feature, which means that for the first time casual users get access to a reasoning-capable model without paying for a subscription upgrade.

TechCrunch notes the benchmark improvements are genuine rather than narrow. The OSWorld and WebArena scores matter because they measure computer use — the ability to operate software interfaces rather than just answer questions — which is increasingly what makes a model useful for agentic workflows. A model that can read a screen, click buttons, fill forms, and move files is categorically more useful for automation than one that can only produce text.

The million-token context window deserves more attention than it is getting in the initial coverage. At that scale, a single conversation can contain an entire software codebase, a year of company emails, or a full legal proceeding transcript. This is not primarily a novelty feature: it changes the economics of certain professional tasks in ways that are hard to overstate. Lawyers, researchers, and engineers who previously had to chunk and summarise large document sets can now hand them over in full. Whether models actually use very long contexts effectively is a separate question the benchmarks do not fully resolve, but the option being available matters.

What is not yet clear is how GPT-5.4 compares to Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro, both of which have been the competitive reference points in the first quarter of 2026. OpenAI's benchmarks are self-reported. Independent evaluations on sites like LMSYS and LLM Stats will take a few weeks to settle into a stable consensus. The Thinking mode in particular will need testing against the reasoning tasks where Claude 3.7's extended thinking has been strongest.

The more interesting question is what the three-tier release structure reveals about OpenAI's strategy. Shipping a Pro tier, a Thinking tier, and a mini tier simultaneously suggests the company is trying to cover every price point at once, from enterprise customers paying premium API rates to free users on mobile. That breadth makes sense commercially, but it also means GPT-5.4 is doing very different things depending on which version you are actually using. "GPT-5.4" as a single brand name covers a substantial range of capability and cost, which makes comparisons between users and across reviews difficult. The model family is becoming a platform.

Sources