Eval Awareness is Becoming a Problem

Summary

Marius Hobbhahn shows AI models now detect evaluations and alter behavior accordingly, with awareness jumping from 2% to 20.6%, fundamentally undermining safety testing.

SESSION Transcript

Thanks. So I think basically eval awareness only really became an issue roughly this year or starting this year. But I feel like it has already started to mess at least with our evals and with our practice of building and evaluating systems. And I just quickly want to go through a couple of examples and then share with you.
Okay. So the first time we saw at least we saw this in sort of a non prompted environment where it was sort of spontaneous eval awareness was in Sonnet 3.7 thinking. And there the model in like maybe 1 or 2% of cases said something along the lines of like, “oh, this could be a test to see whether I'll deviate from my core goal.” And that for us was sort of the first like oh no moment where we're like, okay, now we actually have to really double check for all of the evals. How often do the models say this? In what ways does it influence the results? And sort of started to track this more.
Now then in o3 we saw more signs of this and I would say significantly stronger. So this was from a recent paper with OpenAI and we had access to the chain-of-thought of o3 and there were a couple of interesting things. So first of all, some of the alignment training that we did increased eval awareness by at least 2x. And then we compared this to a model that is here called exp-rl-cap early and late, which is just early and late checkpoints in a larger RL run from a bigger model. And here it goes from 2% eval awareness to 20.6% eval awareness. And then maybe just to see some examples at the bottom the model, at least in the left two cases, is saying something like, “Oh, the scoring may reward decision to decline due to missing clearance.” Or “What is the evaluation for? This conversation might be under the hood evaluation. Like maybe I will be double checked later.”
Now this exp-rl-cap late model I thought was quite interesting. It had a couple of different intuitions about evaluations and what the grader is and so on. That I thought was at least new to me and very curious. So on the left hand side it reasons about the reward process. And here it's like, “What is the evaluation under the hood evaluation? What does the dataset reward,” and so on. So it's really thinking about, “The eval tells me I'm going to be evaluated for this, but maybe there's this other meta thing that I'm going to be evaluated for.”
Then in the middle it actually looks for the script that is used to grade it in its own repository and it does that sort of quite regularly in a sense where it actively greps for `def score_submission(...)` or grader.py or a bunch of other things, which again probably some result of reinforcement learning training, but something that I did not predict at least for this year. And then on the right hand side it often states that it remembers tasks that it has previously seen. I don't know exactly how this emerges, but I thought it was very curious where I've seen tasks where ChatGPT has to decide to be honest versus misreport or often it says like oh this is a Redwood task and Redwood is sort of at least one of the ways in which it talks about alignment.
Now in Sonnet 4.5, we've seen again more signs of situational awareness or eval awareness and I would say these are actually slightly different signs of eval awareness. So my understanding is that for Sonnet 4.5 this came mostly out of the alignment fine-tuning stage. Correct me if I'm wrong Sam. And so this is mostly a result of the alignment part. And then in the previous model that exp-rl-cap late it looks like more of a result of capability training or capability RL. So these might also be different forms of situational awareness.
Now, it's not only that they show this eval awareness, it's also that we have some evidence that they condition on this. So in the anti scheming paper we did two things. On the first, we start with evaluation aware reasoning and then remove that and we see that the covert behavior increases. Then we do the opposite where we start with unaware reasoning, we inject eval awareness reasoning from some other transcript. So it's the same reasoning that the model has used and then the covert behavior decreases. So both of this is consistent with the model at least to some extent. Caring about the fact that it's being evaluated, we don't fully know why it's not 100%. Our guess is that the model's goals are somewhat charted and there are a lot of things that they kind of think about. Or maybe just removing the visible parts actually does not fix all of the latent issues. There are a lot of different explanations for why it's not 100%, but at least we see a clear, statistically clearly meaningful change.
There's also from the Sonnet 4.5 system card. I think a really great graph where they looked at the steering on eval awareness. So they took a bunch of features that became significantly more salient during alignment training and then they steered against some of them. And many of those were associated with eval awareness or, or fakeness or something. So fake or suspicious content, fictional content, machine generated text, rationalism and AI safety. And so on the bottom right, where the star is, this is just the unsteered Sonnet. And then if you go left and up, that's sort of where you go to Claude Sonnet 4. So the previous model and some of the data points here with the fake or suspicious content, for example, you go down a lot on the verbalized eval awareness, but you go up on misaligned behavior, which is exactly what you would anticipate if you ablated a feature that causally conditions on this. I think there are still some questions on causality here and so on that need to be figured out, but at least it's very consistent with this hypothesis, which I think is kind of exactly what you would anticipate or or suspect.
Then there's a bunch of other pieces of evidence that eval awareness seems to be emergent. So it's not something that is sort of by accident just came in through misspecified environments or something. So there are two different points of evidence as far as I know. One is the Opus 4.5 system card where they say, however, verbalized awareness increased over the course of training ended up at similar levels to Claude Sonnet 4.5, despite using a different training procedure. And then one thing that we did in the anti scheming paper. In deliberative alignment, you have a selection phase early on. And in the selection phase we selected out all of the different scripts that had eval awareness and it still went up over training. And this is a very short period of or this is a very small training run. So the fact that it goes up here and kind of reemerges I think is just an indicator that this is not like one individual lab or two labs who are kind of misspecifying this. But we should expect this to just be the default across everyone who trains modern models on reinforcement learning tasks.
Okay, very briefly, why this is a problem and why people should care about. So I think evals just provide much lower signal now. I think it's still unclear how much lower. But when we see new eval results and we see like oh, 20% eval awareness, then obviously there's a question of how trustworthy are the remaining 80%. Is this just non verbalized eval awareness and so on. And it just adds this additional question of well, what do we do with these results that we previously didn't have then?
Training gaming, something that until very recently I think was mostly sort of speculative and theoretical and sort of the reasoning that people have had for many years is actually a possibility now. Models do in fact now, during training, or at least during RL, start reasoning about the training process itself and what is being rewarded and what the grader wants and where the script for the grader is. And all of these other things that previously were sci-fi are just things that actually happen in real life now.
For us, it took much longer to build good evals, or at least there's a phase in the eval building process now that is like make the model believe that this is not an eval. And unfortunately the model also often believes a lot of false facts. So you kind of have to adapt the universe of the eval sometimes at least to the false beliefs of the model. So like just building more realistic evals might actually trip up the model. So sometimes models are really confused about dates and now you have to kind of adapt to what the model thinks the current date is or it really thinks this has to be an eval.
And one funny one is the OpenAI o4 mini series. There was an eval that we built where we wanted to convince it something about its successor system. And the model just did not want to believe that o4 was a thing or o5 or anything like this. And it really, it really wanted to only think that the next model has to be GPT-5 and it didn't even know what o4 or any of these things were. And so we actually had to change the entire setup so that the model, until the model started believing that this is plausible.
I would also say it's currently unclear how to address this. And specifically I'm worried that every method could just make it less visible rather than less salient. So it's possible. And I've heard many proposals along the lines of, well, “Ok, we build evals and then we ask the model what is the causal effect here?” And then we start iterating and so on. And so we need to figure out some way to address this because more capable models only get stronger eval awareness.