The 79
Posts
OpenAI launches a new benchmark for web browsing AI agents

OpenAI launches a new benchmark for web browsing AI agents

Even its top models struggle to score high

Aram Donyaee
April 11, 2025 • Estimated Reading Time: 5 minutes

OpenAI is raising the bar for AI web browsing capabilities, launching a challenging new benchmark called BrowseComp designed to test how well agents can discover hard-to-find information scattered across the internet. As AI models increasingly integrate web browsing to answer user queries, OpenAI argues that existing tests are becoming too easy, particularly for sophisticated models like GPT-4o equipped with browsing and Deep Research tools.

The Problem of Benchmark Saturation

Simple fact-retrieval benchmarks like SimpleQA are already being "saturated," according to OpenAI. To push the boundaries, they have developed BrowseComp, focusing on questions where the answers are difficult to locate; potentially requiring sifting through tens or even hundreds of websites yet the answers are simple and factual, making them easy to verify.

OpenAI detailed the new benchmark and its findings in a recent blog post, emphasizing the need to measure the "persistence and creativity" required for complex information discovery.

A pie chart that shows the percentage of each topic included in the training dataset of BrowseComp

Source: OpenAI | Topics of problems included in the training dataset of BrowseComp

BrowseComp consists of 1,266 problems in 10 different categories (TV shows & movies, Science & technology, Art, History, Sports, Music, Video games, Geography, Politics, and Other) crafted specifically to be tough nuts to crack.

OpenAI ensured the difficulty by verifying that current models (including GPT-4o and its internal o1 model) couldn't easily solve them, that simple search engine queries wouldn't surface the answers on the first page, and that even human researchers would struggle.

The benchmark employs what OpenAI calls an "asymmetry of verification," where finding the answer is difficult, but checking its correctness is straightforward. An example question involves finding a specific scientific paper based on publication constraints and the obscure undergraduate backgrounds of multiple authors which is easy to check once found, but a nightmare to locate initially.

Human Performance

OpenAI put humans to the test (without AI assistance). The results were interesting: humans considered nearly 71% of the problems impossible to solve within a two-hour time limit. For the roughly 29% they could solve, success often came only after one to three hours of dedicated searching, and even then, the human-found answer matched the reference answer only about 86% of the time.

AI Performance

OpenAI tested its own models against the benchmark as well and found significant performance differences. Standard models like GPT-4o and GPT-4.5 scored near zero without using tools. Adding browsing to GPT-4o only yielded a marginal improvement to 1.9% accuracy, suggesting that simply having access to the web isn't enough. Interestingly, OpenAI's o1 model, noted for stronger reasoning but lacking web browsing abilities, performed better, indicating that internal knowledge and inference play a role for some questions.

However, the standout performer was OpenAI's Deep Research model, an agent specifically trained for persistent, complex web browsing. This model managed to solve around half of the BrowseComp problems, demonstrating its ability to strategically navigate the web, synthesize information from multiple sources, and adapt its search strategy.

This table shows the the performance of OpenAI models on BrowseComp benchmark

OpenAI | The performance of OpenAI models on BrowseComp benchmark

OpenAI also highlighted that performance, particularly for the Deep Research agent, scales with compute time. This means more effort spent searching leads to better results. Furthermore, using aggregation techniques like running multiple attempts and selecting the most confident answer (best-of-N)* significantly boosted the Deep Research model's accuracy by 15-25%, suggesting the model often has a good sense of when it has found the correct, verifiable answer.

Obviously, BrowseComp doesn't capture the full spectrum of user interactions (like handling ambiguous queries or generating long-form answers) and OpenAI is aware of it. However, the company considers this new benchmark as a vital tool for measuring a core AI capability: the skillful and persistent pursuit of specific, hard-to-find information online.

Deep Research is Everywhere

The prompt box of three top AI chatbots: first one is Grok 3, second one is Gemini and the third one is Perplexity. All have Deep Research capabilities.

Deep Research feature in Grok 3 (top), Gemini (middle), and Perplexity (bottom)

OpenAI is not the only company that provides Deep Research tools. Perplexity, xAI (with Grok 3’s DeepSearch and DeeperSearch), and Google (with Gemini Deep Research) also offer similar features at much lower costs. With this new benchmark, the community can now truly compare the performance of these research and web browsing tools effectively.

OpenAI has open-sourced the BrowseComp benchmark, making it available in their simple-evals GitHub repository, inviting the wider research community to use it for evaluating and improving AI-driven web browsing agents.

If you like to read OpenAI’s full 11-page paper explaining the details of BrowseComp, click here.

* Best-of-N (BoN) sampling is a commonly used strategy for test-time scaling of LLMs (Large Language Models), which often relies on rewarding models to select the best candidate solution from multiple generations.