AI

Many safety evaluations for AI models have significant limitations

Comment

Text to video concept, text-to-video by generative AI. Language model technology. Cyborg hand holding vdo generated by artificial intelligence.
Image Credits: Ole_CNX (opens in a new window) / Getty Images

Despite increasing demand for AI safety and accountability, today’s tests and benchmarks may fall short, according to a new report.

Generative AI models — models that can analyze and output text, images, music, videos and so on — are coming under increased scrutiny for their tendency to make mistakes and generally behave unpredictably. Now, organizations from public sector agencies to big tech firms are proposing new benchmarks to test these models’ safety.

Toward the end of last year, startup Scale AI formed a lab dedicated to evaluating how well models align with safety guidelines. This month, NIST and the U.K. AI Safety Institute released tools designed to assess model risk.

But these model-probing tests and methods may be inadequate.

The Ada Lovelace Institute (ALI), a U.K.-based nonprofit AI research organization, conducted a study that interviewed experts from academic labs, civil society and those who are producing vendors models, as well as audited recent research into AI safety evaluations. The co-authors found that while current evaluations can be useful, they’re non-exhaustive, can be gamed easily and don’t necessarily give an indication of how models will behave in real-world scenarios.

“Whether a smartphone, a prescription drug or a car, we expect the products we use to be safe and reliable; in these sectors, products are rigorously tested to ensure they are safe before they are deployed,” Elliot Jones, senior researcher at the ALI and co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety evaluation, assess how evaluations are currently being used and explore their use as a tool for policymakers and regulators.”

Benchmarks and red teaming

The study’s co-authors first surveyed academic literature to establish an overview of the harms and risks models pose today, and the state of existing AI model evaluations. They then interviewed 16 experts, including four employees at unnamed tech companies developing generative AI systems.

The study found sharp disagreement within the AI industry on the best set of methods and taxonomy for evaluating models.

Some evaluations only tested how models aligned with benchmarks in the lab, not how models might impact real-world users. Others drew on tests developed for research purposes, not evaluating production models — yet vendors insisted on using these in production.

We’ve written about the problems with AI benchmarks before, and the study highlights all these problems and more.

The experts quoted in the study noted that it’s tough to extrapolate a model’s performance from benchmark results and it’s unclear whether benchmarks can even show that a model possesses a specific capability. For example, while a model may perform well on a state bar exam, that doesn’t mean it’ll be able to solve more open-ended legal challenges.

The experts also pointed to the issue of data contamination, where benchmark results can overestimate a model’s performance if the model has been trained on the same data that it’s being tested on. Benchmarks, in many cases, are being chosen by organizations not because they’re the best tools for evaluation, but for the sake of convenience and ease of use, the experts said.

“Benchmarks risk being manipulated by developers who may train models on the same data set that will be used to assess the model, equivalent to seeing the exam paper before the exam, or by strategically choosing which evaluations to use,” Mahi Hardalupas, researcher at the ALI and a study co-author, told TechCrunch. “It also matters which version of a model is being evaluated. Small changes can cause unpredictable changes in behavior and may override built-in safety features.”

The ALI study also found problems with “red-teaming,” the practice of tasking individuals or groups with “attacking” a model to identify vulnerabilities and flaws. A number of companies use red-teaming to evaluate models, including AI startups OpenAI and Anthropic, but there are few agreed-upon standards for red-teaming, making it difficult to assess a given effort’s effectiveness.

Experts told the study’s co-authors that it can be difficult to find people with the necessary skills and expertise to red-team, and that the manual nature of red-teaming makes it costly and laborious — presenting barriers for smaller organizations without the necessary resources.

Possible solutions

Pressure to release models faster and a reluctance to conduct tests that could raise issues before a release are the main reasons AI evaluations haven’t gotten better.

“A person we spoke with working for a company developing foundation models felt there was more pressure within companies to release models quickly, making it harder to push back and take conducting evaluations seriously,” Jones said. “Major AI labs are releasing models at a speed that outpaces their or society’s ability to ensure they are safe and reliable.”

One interviewee in the ALI study called evaluating models for safety an “intractable” problem. So what hope does the industry — and those regulating it — have for solutions?

Hardalupas believes there’s a path forward, but that it’ll require more engagement from public-sector bodies.

“Regulators and policymakers must clearly articulate what it is that they want from evaluations,” she said. “Simultaneously, the evaluation community must be transparent about the current limitations and potential of evaluations.”

Hardalupas suggests that governments mandate more public participation in the development of evaluations and implement measures to support an “ecosystem” of third-party tests, including programs to ensure regular access to any required models and data sets.

Jones thinks that it may be necessary to develop “context-specific” evaluations that go beyond simply testing how a model responds to a prompt, and instead look at the types of users a model might impact (e.g. people of a particular background, gender or ethnicity) and the ways in which attacks on models could defeat safeguards.

“This will require investment in the underlying science of evaluations to develop more robust and repeatable evaluations that are based on an understanding of how an AI model operates,” he added.

But there may never be a guarantee that a model’s safe.

“As others have noted, ‘safety’ is not a property of models,” Hardalupas said. “Determining if a model is ‘safe’ requires understanding the contexts in which it is used, who it is sold or made accessible to, and whether the safeguards that are in place are adequate and robust to reduce those risks. Evaluations of a foundation model can serve an exploratory purpose to identify potential risks, but they cannot guarantee a model is safe, let alone ‘perfectly safe.’ Many of our interviewees agreed that evaluations cannot prove a model is safe and can only indicate a model is unsafe.”

More TechCrunch

Rocket Lab surpassed $100 million in quarterly revenue for the first time, a 71% increase from the same quarter of last year. This is just one of several shiny accomplishments…

Rocket Lab’s sunny outlook bodes well for future constellation plans 

In 1996, two companies, Patersons HR and Payroll Solutions, formed a venture called CloudPay to provide payroll and payments services to enterprise clients. CloudPay grew quietly over the next several…

CloudPay, a payroll services provider, lands $120M in new funding

The vulnerabilities allowed one security researcher to peek inside the leak sites without having to log in.

Security bugs in ransomware leak sites helped save six companies from paying hefty ransoms

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

A comprehensive list of 2024 tech layoffs

A new “beta rabbit” mode adds some conversational AI chops to the Rabbit r1, particularly in more complex or multi-step instructions.

Rabbit’s r1 refines chats and timers, but its app-using ‘action model’ is still MIA

Los Angeles is notorious for its back-to-back traffic. Three events that promise to bring in millions of spectators from around the world — the 2026 World Cup, the Super Bowl…

Archer to set up air taxi network in LA by 2026 ahead of World Cup

Featured Article

Amazon is fumbling in India

Amazon’s decision to overlook quick-commerce in India is now looking like a significant misstep.

Amazon is fumbling in India

OpenAI’s GPT-4o, the generative AI model that powers the recently launched alpha of Advanced Voice Mode in ChatGPT, is the company’s first trained on voice as well as text and…

OpenAI finds that GPT-4o does some truly bizarre stuff sometimes

On Thursday, Box filled in a missing piece on its AI platform when it bought automated metadata extracting startup, Alphamoon.

Box adds crucial piece to its AI platform with Alphamoon acquisition

OpenAI has announced a new appointment to its board of directors: Zico Kolter. Kolter, a professor and director of the machine learning department at Carnegie Mellon, predominantly focuses his research…

OpenAI adds a Carnegie Mellon professor to its board of directors

Count Spotify and Epic Games among the Apple critics who are not happy with the iPhone maker’s newly revised compliance plan for the European Union’s Digital Markets Act (DMA). Shortly…

Spotify and Epic Games call Apple’s revised DMA compliance plan ‘confusing,’ ‘illegal’ and ‘unacceptable’

Thursday seeks to shake up conventional online dating in a crowded market. The app, which recently expanded to San Francisco, fosters intentional dating by restricting user access to Thursdays. At…

Thursday, the dating app that you can use only on Thursdays, expands to San Francisco

AI companies are gobbling up investor money and securing sky-high valuations early in their life cycle. This dynamic has many calling the AI industry a bubble. Nick Frosst, a co-founder…

Cohere co-founder Nick Frosst thinks everyone needs to be more realistic about what AI can and cannot do

Instagram is rolling out the ability for users to add up to 20 photos or videos to their feed carousels, as the platform embraces the trend of “photo dumps.” Back…

Instagram is embracing the ‘photo dump’

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Anyone paying…

Lyft ‘opens a can of whoop ass’ on surge pricing, Tesla’s Dojo explained and Saudi Arabia pumps $1.5B into Lucid

Flint Capital just closed its third fund at $160 million. Its has a unique strategy for finding its limited partner investors. 

Flint Capital raises a $160M through an unusual fund-raising strategy

Earlier this week it emerged that the DPC had instigated court proceedings seeking an injunction against X over the data processing without consent.

Elon Musk’s X agrees to pause EU data processing for training Grok

During testing, Google DeepMind’s table tennis bot was able to beat all of the beginner-level players it faced.

Google DeepMind develops a ‘solidly amateur’ table tennis robot

The X account announced that its Premium+ subscription would now be “fully” ad-free, leading some to question how this change would affect creator earnings.

As X sues advertisers over boycott, the app ditches all ads from its top subscription tier

Apple has further revised its compliance plan for the European Union’s Digital Markets Act (DMA) rulebook, which, since March, has forced it to give iOS developers more freedom over how…

Apple revises DMA compliance for App Store link-outs, applying fewer restrictions and a new fee structure

The rise of neobanks has been fascinating to witness, as a number of companies in recent years have grown from merely challenging traditional banks to being massive players in and…

Chime and Dave execs are coming to TechCrunch Disrupt 2024

If you visited the Wikipedia website on mobile this week, you might have seen a pop-up indicating that dark mode is ready for prime time.

How to enable Wikipedia’s dark mode

The home security company says attackers accessed databases containing customer home addresses, email addresses, and phone numbers.

Home security giant ADT says it was hacked

The Looking Glass Pro has a 6-inch display and a foldable base. It shows spatial images like those created with the Apple Vision Pro and iPhone 15 Pro.

Looking Glass’ new lineup includes a $300 phone-sized holographic display

TikTok’s latest offering is capitalizing on the app’s ability to serve as a discovery engine for other media — something its users already take advantage of by sharing short clips…

TikTok partners with Warner Bros. to become a discovery engine for TV and movies

Cocoon is a new startup built on the belief that greener steel production and the creation of concrete slag doesn’t have to be an either/or proposition.

Cocoon is transforming steel production runoff into a greener cement alternative

SoundHound, an AI company that makes voice interface tech used by car companies, restaurants and tech firms, is doubling down on enterprise services by playing consolidator in a crowded market.…

SoundHound acquires Amelia AI for $80M after it raised $189M+

Seeking mental health support is a complex process, but some founders believe that using AI to formalize techniques like cognitive behavioral therapy (CBT) can help folks who might not have…

Feeling Great’s new therapy app translates its psychiatrist co-founder’s experience into AI

The U.K.’s antitrust regulator has confirmed that it’s carrying out a formal antitrust investigation into Amazon’s ties with Anthropic, after Amazon recently completed a $4 billion investment into the AI startup.…

UK launches formal probe into Amazon’s ties with AI startup Anthropic

Bardeen has raised $3M to build its platform that uses a natural language interface to automate repetitive knowledge work.

AI business agent startup Bardeen pulls in strategic investment from Dropbox and HubSpot