AI Benchmarking Costs Soar: Is Trust in Reasoning Models Threatened?

The world of Artificial Intelligence (AI) is witnessing a rapid evolution with companies like OpenAI pushing the boundaries of sophisticated ‘reasoning’ models. These models, capable of solving complex problems step-by-step, hold immense promise for fields like physics. However, as AI benchmark testing gains popularity, a sobering reality emerges: verification costs are skyrocketing, raising questions about transparency and trust in this rapidly evolving landscape. Third-party firms specializing in AI benchmarking are shedding light on this trend, revealing the steep price tags associated with assessing these advanced models. For instance, OpenAI’s ‘o1 Reasoning Model’ cost $2,767 for testing across seven popular benchmarks, while Anthropic’s ‘Claude 3.7 Sonnet’ reached a hefty $1,485.35. This is driving alarm bells in the cryptocurrency world where transparency and verifiable data are paramount for investors. The reasons behind these escalating costs lie in token generation – the fundamental units of text data crucial to AI models. Reasoning models require significantly more tokens compared to their non-reasoning counterparts, directly impacting testing costs. 3rd party firms like Artificial Analysis report spending $5,200 just evaluating a dozen reasoning models, nearly doubling the $2,400 they spent on over 80 non-reasoning models. Even seemingly