Write Down the Problem: The State of AI Safety Evaluations
The Feynman algorithm
In a recent call with a mentor, he introduced to me the Feynman algorithm, a simple 3 step process that can be widely generalized to solve any challenging problem.
It proceeds as follows:
Write down the problem.
Think real hard.
Write down the solution.
He said that people tend to focus on the comically underspecified step 2, but what he saw was that in solving a problem, actually thinking about the solution was only a third of the battle. Much progress can be made from just tackling the first step, defining exactly what we’re trying to measure and solve.
Step 1: Why evals?
For AI safety, I believe evals do the job of step 1. Claims that a model is “aligned to human values”, “controllable” or has “superhuman levels of reasoning” are made much stronger when operationalized into verifiable and replicable evaluations.
The demand for reliable evaluations already exists today, and will only increase with model capabilities. Existing governance frameworks and frontier lab commitments explicitly rely on eval results to make high stakes decisions on what safety measures are appropriate to implement.
Compared to other important but high uncertainty endeavours in AI safety, evals present a solution with a large existing political and frontier lab buy in, coupled with a clear path to adoption. The only thing left is to make them actually good.
Step 2: How to make effective evals?
Someone really should sit down and think real hard about this.
There’s a multitude of reasons why good evals are challenging, with much already written. Briefly going over some points:
Data Contamination: Solutions to the benchmark are in model training allowing them to cheat
Test quality: Large benchmarks like SWE-bench and MMLU were found to contain errors
Gaming benchmarks: Through multiple submissions, large players can inflate their scores on leader boards
Eval awareness: Models noticing they are being evaluated and modifying behaviour
Inaccurate claims and issues with construct validity (i.e. does a test actually measure what it claims to measure)
Benchmark saturation: New evals being mastered by models quickly after release
Sensitivity to small changes: Minor edits, like changing bracket types, significantly affects score
Step 3: Where should these evals go?
Fortunately, many people have already sat down and done the work creating evaluations for a ton of traits important to safety. However, many evals remain siloed off in their own implementations.
Currently, there is still a lack of a definitive standard for researchers and policy makers which says “Yes, this is the score for this frontier model on <insert safety risk>”. However, there have been quite a few promising efforts made so far.
Existing eval aggregations
Frontier labs and third party evaluators
Some of the most extensive safety evaluation of models have come from frontier labs themselves, like the system cards from Anthropic and OpenAI. Apollo has also helped carry out red teaming and evaluations for frontier labs.
While there is a lack of a consistent standards, each lab has their own workflow and chat formats. One example is Anthropic having a specific Assistant/User template which was not replicated in evaluations, resulting in deflated performance.
The difficulty of full transparency and conflicting interests inevitably makes this method unreliable, but currently these seems like the numbers which matter the most.
Helm safety
A great effort has been made to collate many safety benchmarks into a single place here, and is by far the largest safety focused leaderboard at time of writing. This appears to be an extremely promising candidate, though bottlenecked by manpower as it’s a volunteer research effort at the moment.
Inspect
Open source framework for building evaluations, created by the UK AI security institute. Looks like a really important first step to provide a common framework between evals, and was adopted by apollo.
NIST
Despite not currently being a major player, they’ve recently been taking steps to develop AI related metrics. Given the major role US labs play, NIST might be an important future player to establish regulation and standards for AI safety.
Smaller projects
Here and there papers seem to crop up which try to solve a certain issue with evals, but don’t seem to gather the critical mass required to establish a standard. Trying to create a successful new eval aggregation seems like an incredible amount of effort, since it requires keeping up with a rapid flow of new evals and working with existing players to integrate with their separate systems
Capability Evaluations
MMLU
GSM8K
HELM
SuperGLUE
Chatbot arena
Hugging face LLM leaderboard
BIG-Bench
Some examples of large benchmarks, mainly focused on capabilities
Conclusion
Seems like the evals scene is promising but fragmented. A unifying suite of tests and benchmarks has yet to be established and there still isn’t somewhere which is THE place to go for safety and capability evaluations.
However, Helm safety and Inspect are great starting points, and much of evaluations seem bottle necked on engineering effort. It seems a lot of these problems seem tractable and critical to solve for any future AI safety policies. Evals are step 1, and it seems we haven’t finished writing the problem down yet.
Epistemic status: Spent ~3 days of reading and writing this article with minimal past experience in evals. This serves as a mini lit review for the different way evals are distributed and a collection of useful eval in 2026 for newcomers (like me)
References/readings
Useful lit review for a broad understanding
A lot of things written by Marius Hobbhahn and apollo
https://www.apolloresearch.ai/blog/an-opinionated-evals-reading-list/ -
https://www.apolloresearch.ai/blog/a-starter-guide-for-evals/
Broad lit reviews about evals
https://arxiv.org/abs/2502.06559
https://arxiv.org/abs/2505.05541
Part 1: Why evals
https://www.apolloresearch.ai/blog/the-evals-gap/ https://www.lesswrong.com/posts/3jnziqCF3vA2NXAKp/six-thoughts-on-ai-safety (Point 4)
Part 2: Challenges with evals
https://www.anthropic.com/research/evaluating-ai-systems
https://www.lesswrong.com/posts/LhnqegFoykcjaXCYH/100-concrete-projects-and-open-problems-in-evals
On issues with capability benchmarks specifically
https://arxiv.org/abs/2504.20879
https://arxiv.org/abs/2510.07575
