OpenAI has introduced ‘BrowseComp,’ a challenging new benchmark designed to assess the abilities of AI agents in navigating complex online information networks and extracting relevant data. This test, featuring 1,266 intricate questions designed to mimic an immersive ‘online treasure hunt,’ seeks to simulate scenarios where answers are difficult to uncover but readily verifiable. Questions span diverse fields, from film and technology to history, posing significantly higher difficulty than existing tests like SimpleQA. According to the AIGC Open Community, BrowseComp presents a formidable challenge, with even OpenAI’s own models, GPT-4o and GPT-4.5, demonstrating accuracy rates of only 0.6% and 0.9%, respectively. However, their recently released Agent model, Deep Research, has achieved a substantially higher accuracy rate of 51.5%.