• The 79
  • Posts
  • OpenAI believes current AI benchmarks are broken

OpenAI believes current AI benchmarks are broken

And they are working with other companies to build better ones

OpenAI believes that current AI benchmarks such as GPQA, LiveCodeBench, MMLU, ARC-AGI, and many others are broken in a way that they don’t reflect how good an AI model performs in real-world situations.

Looking for a recent example of why AI benchmarks are misleading and can be easily gamed? We’ve got you covered! Meta just recently released Llama 4, a series of open-weight AI models including Scout, Maverick, and the yet-to-be-released Behemoth. Maverick, in particular, was advertised as a high-performing model, achieving an impressive ELO score of 1417 on LM Arena, a popular crowdsourced benchmark where human users compare AI outputs head-to-head. This score placed Maverick near the top of the leaderboard, just behind Google's Gemini 2.5 Pro and ahead of models like OpenAI's GPT-4o, suggesting it was a competitive alternative to leading proprietary models. However, cracks in this narrative emerged soon after.

The version of Maverick tested on LM Arena wasn’t the same as the one Meta made publicly available. Meta admitted that the LM Arena version was an "experimental chat version" of Maverick, specifically "optimized for conversationality." Researchers and users quickly noticed stark differences between this version and the downloadable one as clearly described in this TechCrunch article. Just because a huge tech company says their model has achieved a very high score in a benchmark, doesn't mean that the model is good at every task, nor does it guarantee the truthfulness of their claims.

Returning to OpenAI, they have unveiled a new initiative announced through an official blog post which aims at bridging the gap between powerful foundation AI models and the specific, high-stakes demands of various industries.

The initiative is called OpenAI Pioneers Program and seeks to partner directly with companies, initially focusing on startups, to co-develop tailored AI solutions and establish rigorous real-world evaluation standards.

As AI continues its rapid integration across sectors from finance to healthcare, OpenAI is aware of a growing need for models that don't just perform well generally, but excel reliably in specialized, practical environments. The Pioneers Program tackles this challenge in two main ways: creating domain-specific benchmarks and building custom-tuned models.

Domain Specific Benchmarks

This involves collaborating with multiple companies within key industries, such as legal, finance, insurance, healthcare, and accounting to create standardized evaluation suites, or what OpenAI calls "evals." OpenAI notes a lack of a "unified source of truth" for model benchmarking in these fields.

By working intensively with partners, its research teams aim to design evals that truly reflect real-world use cases, setting clear performance bars and ultimately increasing trust in AI systems within those sectors. These industry-specific evals are planned for public release at a later date, potentially creating valuable resources for the broader ecosystem.

It almost feels like we are going back to the good old days when engineers would train traditional legacy machine learning models for very specific use-cases. Only this time, we are doing almost the same thing with LLMs. As we have all seen so far, even the most advanced models such as OpenAI’s o1, xAI’s Grok 3, DeepSeek-R1, and Google’s Gemini 2.5 Pro, despite being so good at solving general problems, can not satisfy the specific needs of many industries in a reliable and trustworthy manner.

Custom Tuned Models

The second, perhaps more direct benefit for participating companies, is the opportunity to fine-tune models for specific tasks using a technique OpenAI calls Reinforcement Fine-Tuning (RFT). This process allows for the creation of highly specialized "expert models" optimized for a narrow set of tasks relevant to the company's domain.

Participants will receive guidance from OpenAI's research team to train custom models tailored to their top three use cases, aiming to address specific customer pain points or improve operational efficiencies. OpenAI says these resulting models should be robust enough for production deployment at scale.

For a start, the Pioneers Program is targeting startups working on "high-value, applied use cases where AI can drive real-world impact." This focus suggests OpenAI is interested to foster innovation at the application layer, ensuring its foundational technology translates into tangible benefits in complex domains.

By embedding its researchers with these early partners, OpenAI not only helps the startups but also gains invaluable insights into the practical hurdles and opportunities of deploying advanced AI in specialized fields.

Source: OpenAI | The OpenAI Pioneers Program registration form

Startups interested in participating in this program are invited to apply via the provided form at the bottom of this page on OpenAI’s website.