Are AI Benchmarks Reliable? - Technology And AI

Technology companies are locked in a frenzied arms race to release ever-more powerful artificial intelligence tools. To demonstrate that power, firms subject the tools to question-and-answer tests known as AI benchmarks and then brag about the results.

What Are AI Benchmarks?

Google’s CEO, for example, said in December that a version of the company’s new large language model Gemini had “a score of 90.0%” on a benchmark known as Massive Multitask Language Understanding, making it “the first model to outperform human experts” on it. Not to be upstaged, Meta CEO Mark Zuckerberg was soon bragging that the latest version of his company’s Llama model “is already around 82 MMLU.”

Are These Benchmarks Effective?

The problem, experts say, is that this test and others like it don’t tell you much, if anything, about an AI product—what sorts of questions it can reliably answer, when it can safely be used as a substitute for a human expert, or how often it avoids “hallucinating” false answers. “The yardsticks are, like, pretty fundamentally broken,” said Maarten Sap, an assistant professor at Carnegie Mellon University and co-creator of a benchmark. The issues with them become especially worrisome, experts say, when companies advertise the results of evaluations for high-stakes topics like health care or law.

“Many benchmarks are of low quality,” wrote Arvind Narayanan, professor of computer science at Princeton University and co-author of the “AI Snake Oil” newsletter, in an email.

How Are Benchmarks Created?

To find out more about how these benchmarks were built and what they are actually testing for, The Markup, which is part of CalMatters, went through dozens of research papers and evaluation datasets and spoke to researchers who created these tools. It turns out that many benchmarks were designed to test systems far simpler than those in use today. Some are years old, increasing the chance that models have already ingested these tests when being trained. Many were created by scraping amateur user-generated content like Wikihow, Reddit, and trivia websites rather than collaborating with experts in specialized fields. Others used Mechanical Turk gig workers to write questions to test for morals and ethics.

What Do Benchmarks Measure?

The tests cover an astounding range of knowledge, such as eighth-grade math, world history, and pop culture. Many are multiple choice, others take free-form answers. Some purport to measure knowledge of advanced fields like law, medicine, and science. Others are more abstract, asking AI systems to choose the next logical step in a sequence of events, or to review “moral scenarios” and decide what actions would be considered acceptable behavior in society today.

“The creators of the benchmark have not established that the benchmark actually measures understanding,” said Emily M. Bender, professor of linguistics at the University of Washington.

What Are the Concerns?

Problems with the benchmarks are coming into focus amid a broader reckoning with the impacts of AI, including among policymakers. In California, a state that historically has been at the forefront of tech oversight, dozens of AI-related bills are pending in California’s legislature and May saw the passage of the nation’s first comprehensive AI legislation in Colorado and the release of an AI “roadmap” by a bipartisan U.S. Senate working group.

Benchmarks and Leaderboards

Benchmark problems are important because the tests play an outsized role in how proliferating AI models are measured against each other. In addition to Google and Meta, firms like OpenAI, Microsoft, and Apple have also invested massively in AI systems, with a recent focus on “large language models,” the underlying technology powering the current crop of AI chatbots, such as OpenAI’s ChatGPT. All are eager to show how their models stack up against the competition and against prior versions. This is meant to impress not only consumers but also investors and fellow researchers. In the absence of official government or industry standardized tests, the AI industry has embraced several benchmarks as de facto standards, even as researchers raise concerns about how they are being used.

Misplaced Trust

In the 2021 research paper “AI and the Everything in the Whole Wide World Benchmark”, Bender and her co-authors argued that claiming a benchmark can measure general knowledge could be potentially harmful, and that “presenting any single dataset in this way is ultimately dangerous and deceptive.”

Years later, big tech companies like Google boast that their models can pass the U.S. Medical Licensing Examination, which Bender warned could lead people to believe that these models are smarter than they are. “So I have a medical question,” she said. “Should I ask a language model? No. But if someone’s presenting its score on this test as its credentials, then I might choose to do that.”

Building Better Benchmarks

Just as there is an arms race among AI models, researchers have also escalated their attempts to improve benchmarks. One promising approach is to put humans in the loop. “ChatBot Arena” was created by researchers from several universities. The publicly available tool lets you test two anonymous models side by side. Users enter a single text prompt, and the request is sent to two randomly selected chatbot agents. When the responses come back, the user is asked to grade them in one of four ways: “A is better”, “B is better”, “Tie” or “Both are bad.”

ChatBot Arena is powered by more than 100 different models and has processed over 1 million grades so far, powering a model-ranking leaderboard.

No Regulations, No Sign of Slowing

The rapid pace of new model releases shows no sign of slowing. In 2023, 149 major “foundational” models were released, according to Stanford’s AI Index Report, which was double the previous year’s number.

OpenAI CEO Sam Altman and Meta CEO Mark Zuckerberg have both said they would welcome some degree of federal oversight of AI technology, and federal lawmakers have flagged such regulation as an urgent priority, but they’ve taken little action.

In May of this year, a bipartisan Senate working group released a “roadmap” for AI policy which laid out $32 billion in new spending but did not include any new legislation. Congress is also stalled on delivering a federal comprehensive privacy law, which could impact AI tools.

Many researchers echo the same major concern: Benchmark creators need to be more careful how they design these tools, and clearer about their limitations.

Su Lin Blodgett is a researcher at Microsoft Research Montreal in the Fairness, Accountability, Transparency, and Ethics in AI group. Blodgett underscored this point, saying, “It’s important that we as a field, every time we use a benchmark for anything, or any time we take any kind of measurement, to say what is it actually able to tell us meaningfully, and what is it not?

“Because no benchmark, no measurement can do everything.”