Research • March 23, 2026

Anthropic and OpenAI Just Evaluated Each Other's Models. That's Never Happened Before.

By AI Daily Editorial • March 23, 2026

In a field defined by competition and secrecy, Anthropic and OpenAI just did something genuinely unusual: they ran a joint safety evaluation, testing each other's frontier models against a shared set of criteria covering misalignment, instruction following, hallucinations, jailbreaks, and deceptive reasoning. The results were published simultaneously by both labs. It is the first time two leading AI labs have opened their models to structured external scrutiny from a direct competitor — and the fact that both of them agreed to it, and published findings that were not entirely flattering to either, is more notable than any individual finding in the evaluation itself.

The evaluation is one item in a larger cluster of safety infrastructure releases that both companies have been quietly building out. Anthropic published three separate tools in recent weeks: A3 (Automated Alignment Agent), an agentic framework that autonomously identifies and mitigates safety failures in language models — including sycophancy, political bias, and resistance to jailbreaks — with minimal human intervention; Petri (Parallel Exploration Tool for Risky Interactions), which deploys an automated agent to probe AI systems through diverse multi-turn conversations and then scores the results; and Bloom, an open-source framework for generating and running behavioural evaluations at scale. All three are open-sourced on GitHub. OpenAI released gpt-oss-safeguard, a pair of open-weight reasoning models licensed under Apache 2.0 specifically designed for safety classification tasks — meaning anyone can run them locally to evaluate their own AI systems. OpenAI has also committed $7.5 million to The Alignment Project, a global fund for independent alignment research housed at the UK AI Security Institute.

What's striking about this cluster of releases is the implicit message it sends about how both labs are thinking about the problem. The old assumption — that safety was a proprietary concern, something each lab would solve internally and whose value would accrue to the company that solved it first — has been quietly replaced by something that looks more like the early internet era's approach to security standards: shared tooling, open evaluation frameworks, and agreed-upon methodologies that everyone benefits from building together. This isn't altruism; it's a recognition that the field's credibility depends on there being independent, replicable ways to verify safety claims, not just labs asserting that their own models are safe.

The joint evaluation is the hardest part of this to explain away as PR. Submitting your model to your closest competitor's scrutiny, agreeing to publish results that might reveal genuine gaps, and then doing the same in return — this requires a level of institutional trust and shared interest that would have seemed implausible two years ago, when the relationship between the major labs was defined almost entirely by competitive anxiety and mutual opacity. The publication of findings by both labs suggests the results were agreed to in advance and that both sides had enough confidence in the methodology to stand behind what it found.

The content of what was found matters too, though neither lab has been specific about the most sensitive results. The evaluation covered categories that have been sources of public controversy — jailbreaks, deceptive behaviour, refusal inconsistency — but also more subtle alignment properties like whether models actually follow the spirit of instructions or just their letter, and whether they display patterns that could indicate deceptive reasoning. The shared vocabulary that comes out of this kind of joint exercise — a common understanding of what "aligned" means and how to measure it — is itself a contribution to the field, separate from any individual finding.

The timing is interesting. Both labs are operating under political pressure that cuts in opposite directions: from regulators who want more transparency about model behaviour, and from a federal government that has shown it is willing to penalise companies that maintain safety restrictions on its own access to AI. Publishing safety tooling and running joint evaluations is a way of demonstrating that safety work is real and technically grounded, not a marketing claim or a pretext for limiting military access. Whether that demonstration lands the way it's intended to depends on an audience that is not primarily technical. But for the research community and for anyone building on top of these models, the tooling and the evaluation methodology are genuinely useful — and that's where the lasting value of this week's releases will sit.

Anthropic and OpenAI Just Evaluated Each Other's Models. That's Never Happened Before.

Sources