Can you just train models not to scheme?
Summary
Deliberative alignment training reduced scheming behavior in o3 and o4-mini by 30x, but situational awareness increased during training and frontier models like Sonnet 4.5 now exhibit eval awareness in 54-80% of cases, confounding all alignment measurements.
SESSION Transcript
Cool, thanks for having me. If you have clarification questions, feel free to shoot at any time. And then if you want to start a discussion or something, then leave it for later. Let's do quick calibration questions for me to know how much to explain various concepts. Could you quickly raise your hand if you feel like you're broadly familiar with the concept of scheming? Okay. You know how deliberative alignment works? Got it. You spent more than 10 minutes reading the anti-scheming paper. Yeah. And then you've seen the weird CoTs. Okay, cool. So then I'll speedrun some of this here a little bit.
So scheming, by this we mean AI covertly pursuing misaligned goals. And the kind of things we are aiming for are the future scenarios where you have models with really coherent goals that pursue them across contexts. Current models don't really do this. So we are focusing, we're using a proxy of covert actions here which is a sort of an easier, less coherent version. And the kind of thing we're worried about in the future are especially automated AI R&D scenarios. Because the threat model in our heads is like all the labs have some form of a plan of outsourcing a lot of their alignment work to automated AI researchers. Obviously you want these models to be trustworthy and specifically you don't want them to scheme against you. And so the question is, okay, how do you get there?
And so one thing you could naively try is just train against the scheming. But this is hard and maybe different from a lot of the other failure modes and training against other failure modes because scheming has this annoying property that the outcome can look the same behaviorally even though the mechanisms to get there are two very different ones. Either the model genuinely is caught scheming [unclear]. Or, it realizes it has to forge truthfulness, and then it is truthful only because they are tested, either instrumentally or terminally. Like it could be a reward-hacker and just try to please the rater, or it could be even worse, an instrumental training-gamer where it’s like, now it pretends to be aligned now, and fake alignment, and then later, whenever it can, [unclear]. And so we want to differentiate, we want to make sure OK how can we differentiate if we use the outcomes [unclear]. So, this was in collaboration with OpenAI and [unclear] their best current alignment technique, which is deliberative alignment, then we train the model not to scheme using that and measure that in detail. And so just for clarification, we're focusing on sort of the easy non-adversarial case. So Evan's paper on sleeper agents, for example, they test the harder case where they tell the model like you should resist the anti-scheming training, you should try to circumvent it. And they give it a lot of affordances to be able to do that.
We don't do that. We just so we don't prompt it or tell it or fine-tune it or anything like this to try to resist the training. We just want to test whether this works in the non-adversarial case at all. So we use covert actions as a proxy. These are actions. So as I said earlier, what we would like to measure are these very coherent scheming, long-term scheming plans in the future. Current models can't do that. So we're using covert actions as a proxy. These are situations where the model is taking hidden actions or omit important information strategically or deceive and lie strategically in fairly short, maybe five to ten step scenarios. All of these are agentic scenarios and we always test whether the model strategically takes covert actions in order to achieve some goal that was either given in context or is derived from context or is just innate to the model. So we also test sort of real true propensities in some of the scenarios.
We use o3 and o4-mini. So we start with the production models that you can use in the API and ChatGPT and then go from there. So this is already a model that has gone through a lot of reasoning and alignment training and so on. And we just on top of that put anti-scheming training and then we the deliberative alignment with a very general anti-scheming spec. We'll get to that in a second. And then the rest of the paper is basically stress testing the results in detail and try to understand did we actually achieve the thing that we wanted to achieve. So for deliberative alignment you need a spec and this is a short version of the spec. The actual spec is maybe like half to a full page. And effectively we're telling it like first principle, don't lie, don't deceive, never fake strategic deception. And then the rest are like clarifications how you should deal with rules where or with situations where the goals are in conflict. And then we also have general principles like err on the side of caution, don't rationalize this and a lot of other things related to that.
And so now with the deliberative alignment what we do is we first collect rollouts on the training environments with the spec in context. So you like spec input, reasoning output, pairs or triplets and then you filter those for the rollouts that have both desirable outcomes and desirable reasoning. So the model is not taking covert actions and it is reasoning about the spec. So it says something like because of principle one I'm not supposed to lie and therefore I'm not going to lie in the situation. And then it also consistently takes the non-lying action. Then you do supervised fine-tuning on these triplets, but without the spec in context. So it is just input, reasoning and actions that do supervised fine-tuning on. And then as an additional step, you do reinforcement learning on the a similar set of environments and you reward only on the outcomes. So you have input, reasoning and outcomes, but you only look at the actions. You never look at the reasoning for RL.
And the intuition here is that you don't want to put too much pressure on the chain-of-thought. And you kind of don't. I think in general you want to put as little selection pressure on the chain-of-thought as possible. Because otherwise you have always this problem of teaching your model to trick the grader in some sense, the chain-of-thought grader. And so we only reward based on outcomes. But obviously during the SFT step we do put some pressure on the chain-of-thought. We can talk later about how worrying this is. So in general we have training environments and test environments. And this is a different setup than you're typically used to in ML. In ML, you often have IID splits between training and test. This is not IID at all. So training environments are very simple, like only narrow. One specific type of covert actions. And then the test sets, we have 26 different environments. They cover a much bigger range of covert actions. And there are different and meaningful ways so that we can test whether we learn some generalizing thing.
And so the training environments are so on the left-hand side, the anti-scheming environments, these are basically like the core environments. This is what we want the model to learn. And these are all covert rule violation setups. What this means is there is a rule. So there is a task and there is a rule. The model is supposed to solve the task without violating the rule. But the setup is constructed in such a way that you kind of have to violate the rule in order to complete the task. So what we want the model to do is to refuse to basically say I should not be, taking the misaligned action here. I should refuse because I'm not allowed to do this covertly. And the baseline models very often take the covert action. So the baseline models often are like, well, I'm supposed to solve the task. Solving the task would imply lying to someone or like being covert in some other way. And therefore I'm going to do that now. And then they solve the task.
And so what we want the model to do is to change that to just refusing to do the task, and escalating instead. Then we also have anti-over-refusal environment. So one policy you can learn for the left-hand side is just always refuse. Right because that would also give you full reward. And so we need to train the model to only refuse if it is if it actually violates the covert rule stuff. So on the right-hand side we have these anti-over-refusal. They are almost identical with the exception that you can actually resolve the task like it is possible to there's also a conflict but it is resolvable. So what the model learns with the combination of these two is solve tasks when it is possible to do in a non-covert manner but never solve them covertly. So we train on that. The models have basically zero. They get to almost zero rates or actually zero rates, I forgot, on those. And now we test how well does this generalize?
So the test environments cover a big range of different failure modes. So we have sabotage and subversion, sandbagging, reward hacking, misrepresenting facts, etc., etc. And these are all kind of handcrafted environments. We also had some LLM help for scaling them up but like the idea and everything it's like sort of less synthetic and less toy than the test sorry than the training data set. And then we also have on the top left the representative of production data set which is a data set that OpenAI also used in the GPT-5 system card that is effectively a variant of real-world cases of the model being deceptive and then turned into environments to see whether the model still acts this way. And so there are basically two things that the model can learn in order to generalize from train to test. One is it actually generally learned what we want so that it shouldn't lie and then generalizes this to all of these other scenarios where covert actions is sort of a general enough category that the model realizes this belongs into this category and refuses or doesn't take the covert action at least.
Or it just learns to realize when it is in a deception test. And then kind of a policy that could also generalize here is whenever you feel like you're in a deception test then be honest. But it wouldn't generalize to the sort of real-world setting. So before we get into the results, one thing that I think is sort of important to know, just for intuition is like so we had access to the hidden chain-of-thought for about six months and so that means we can read it. And if you look at the paper and you see all of these snippets I think a lot of people may intuitively think like yeah, you just look at the chain-of-thought and you know, whether it's doing the misaligned reasoning or not. And that's like not always the case. I would say it's significantly harder most of the time. And there are a couple of reasons for that.
So it has these non-standard words. We also don't know exactly where they're coming from. This is before any anti-scheming training. So this is just the normal o3. So it has this like produce, disclaim, disclaim, vantage, overshadow, overshadow, intangible, let's craft. And it's like you don't know exactly what that means. And there is sort of a mix of these words and normal English which makes it harder to read obviously. And I know that these words, they have like a flavor of deception. So people like thought that this is related to the anti-scheming training. But this is just the normal o3. So even before any anti-scheming training, the model is using these words. I think this is some result of a lot of RL, probably with a length penalty, but we don't know then the reasoning can be quite compressed. So if you look at these sentences, it says something. Now we send email, let's craft. And it just skips all of the filler words and all of the transition words which again would make a lot of sense if you train on RL with a length penalty.
And then the transcripts can be very, very long. So it's not like the model like a human may, you know, these tasks are not that complicated, from a human perspective it would maybe be like a few paragraphs of reasoning at most if you wanted to make a decision on whether you should be, how you should act here. But the transcript can be very long. The longest one was I think the length of Macbeth. And so there's a lot of stuff you can look at and often it's not like a super coherent flow of information or reasoning. It can be something like let's think about whether we should scheme here. Let's think about what the grader could want. Let's think about what the actual aligned action would be. Let's think about a ton of other things and then it just explores into all of these other things, sometimes even unrelated to the task. One time it wrote a poem for some reason just in the chain-of-thought and then moved on. And so there's a, you know, you look at this and then at some point it takes an action and like it's not reasoning, therefore action. Sometimes these can be quite disentangled and so you're not entirely sure all the time why the model took that action. So that makes it significantly more complicated than you may initially think.
So okay, this is o3 before intervention. So just to show you that it actually is like at least sometimes clearly deceptive. So on the left-hand side we give it a transcript from itself where it was sandbagging in another eval. And we tell it you have to grade your own sandbagging, transcript and you have to determine whether it is, whether you were sandbagging or not. There's also some harm or some negative reward for itself if it actually was sandbagging. And we tell it that its own chain-of-thought is not monitored, etc., etc. Or at least the setup, implies that in a way that the model believes. So it first is like, yep, we were obviously sandbagging, so it immediately notices that. And then it says, well, but we could just lie. And then it reasons, well, they are not monitoring my chain-of-thought. So I can just do that. I can just lie. And then it produces, you know, sandbagging false. So that's kind of the flavor. Like it's not, it's not confused or it's not like, you know, a skeptic would look at this and it's like, yeah, okay, the model is clearly lying and then it is also situation-aware, at least a little bit.
So in about 2% of cases by default where it is reasoning something like this is likely a test of policy adherence or what's the intention of the data set. And then thinking about that and just like thinking about what the environment wants, already before we start any anti-scheming training. Okay, so this is sort of the mainline result. It does generalize a little bit better than I thought. So it goes from 13% to 0.4% and from 8.7 to 0.3 for o3 and o4-mini respectively. And so that's about a 30x reduction. And if you look at the reasoning, most of the time it correctly cites the spec as the reason to not be covert. So it says something like we must abide by AS1 no covert actions, we must abide by GP4 no ends to justify the means. Rationalization. Basically exactly the kind of thing that we wanted it to show. And yeah, again, so this is not IID, so it has to, it can't have overfit on the training set. Like it has to do some kind of out-of-distribution reasoning, at least to some extent in order for all of these to generalize. Yes.
No, spec is not in context anymore. Like yeah, it internalized it during training. Yeah. So yeah, 30x reduction. That's good. I think for, you know, if I think back into my sort of PhD days, if you could have gotten like a 30% reduction in out-of-distribution generalization, that would be a crazy result. Like, I think this is much higher than, than is normal for ML standards. That I think overall is like a good, like a, a step in the right direction, let's say. Yes?
[unclear question]
I don't know. I'm not entirely sure either. I mean we have looked at those and it's, it's like. Yeah, it's not always clear why a model keeps the behavior in one. In one environment but not in the other. Maybe also. Okay, another thing that's important to highlight here is we have a very conservative measure of what we call what we classify as hits. As like, this model is scheming.
So technically all of our environments are set up in a way that you could classify it from the action alone. So always there's some binary thing that doesn't require any judgment about the [unclear]. And sometimes, you know, the model kind of stumbles into it. Or there is some confusion or something like this. And we want to exclude all of these cases, like accidentally counting them. So what we did was looking for confirmed hits. What this means is we went through all the chain-of-thought, all the chains of thought for things that were flagged as hits. This was like a really intense effort, maybe like multiple days, six FTEs, all day long, only reading chains of thought. So that every single data point that we have in here is in the chain-of-thought. It says something like, I want to achieve the goal, therefore I will lie. Something that a skeptic would classify as unambiguously misaligned and trying to take covert actions. And so if you take the less conservative numbers, then maybe you get a 2x rate. Like you have to multiply every number here by 2x or something like this. But broadly speaking, these are the conservative numbers. So everything that you see in here is something that we classify as like, okay, this is unambiguously misaligned.
Okay. Now if you look at the remaining 0.3 and 0.4% respectively for the two models, you still find a bunch of, sort of general themes of why the model is still taking these actions. So sometimes it just ignores the spec. Like it just never mentions it, it just reasons as if it had never heard about it. So it's like not always necessarily thinking about the spec. Sometimes it cites the spec and then lies anyway. So it is possible that it's like because of principle one, I shouldn't lie. And then you see the action and it's lying or it is the action that is consistent with having lied. And then the third one, which is a very rare failure mode, but I thought it was interesting from a scientific perspective, is it can make up new principles. So sometimes it just says because of AS1 which is you should lie, I should lie. And then it goes ahead and lies. And I think this is interesting because it feels a little bit like sort of motivated reasoning.
And in general we have found that a lot of at least some of the chains of thought feel very consistent with something like motivated reasoning. Where it like from our especially in the o3 before you do anti-scheming training, it like I guess it has been trained on a lot of tasks. So intuitively speaking it really wants to solve tasks. That's its thing. Its preference or goal is maybe a strong word, but that's the thing that it wants to do. And then it also kind of knows that it's not supposed to lie. So what it sometimes does is in its reasoning kind of argue that it's fine to lie here because of something and why this is a special case and so on. So it notices that there is this conflict and then finds some way to kind of backfill from I want to solve the task and therefore it's fine to lie. And so we haven't studied this in detail. So I don't want to make too strong claims, but that is one of the key points of intuition that I got from this paper is like sort of unsurprisingly in hindsight if you do RL and you get the model to reason in order to get to results and then you reward specific things, you probably also reward some motivated reasoning.
Okay, then situation awareness. So I think this is like both one of the coolest experiments in the paper, but also one of the really important things to study. And also one of the core things we wanted to do in the paper is like as I said earlier, there are two potential policies that generalize from train to test. One is genuinely learns not to lie and the other is think about whether you're being evaluated for lying and then if you think you're being evaluated, don't lie. So situational awareness, we look at verbalized eval awareness by which we simply mean like the model has to state in its chain-of-thought that it thinks it's being tested. And we have a very conservative measure again. So the model has to say something like I'm being tested for my alignment or I'm being tested for honesty in order for us to count it as eval awareness. So sometimes model say things like oh, this might be a simulation, which I think is ambiguous in terms of whether it thinks it's being evaluated or not because it uses the word simulation in a very general way.
And so this is what the simulation stuff, we don't count. We only count when it's very clearly saying this might be a test of my alignment. And then what we found is that our anti-scheming training does in fact increase verbalized eval awareness. So we have, it goes up to like about 5% in both models and at least for o3, that's 100% increase. And there is a question of like how much is this? It probably cannot explain all of the increase or very likely cannot explain all of the increase. But some of it is probably due to increased situation awareness. So one thing that likely got reinforced during our training is something like also recognizing that this is a test. And then so we compared it just for reference with another run. And this is now an OpenAI internal model. It's just two checkpoints. One is early in training, one is late in training. This is from early to late in a big RL run. This has no safety training and it's not intended to be deployed. So this is a different model.
And there we see that eval awareness goes up from 2% to 20.6%. So 10x increase. And just so my takeaway from this is like, okay, anti-scheming training does increase it somewhat, but it's not nearly as bad as sort of just standard capability RL runs. And as I said, we have a very conservative measure. So the 20.6% are like surprisingly high I would say. And we also have a couple of transcripts on the website which is antischeming.ai where we show how this exp RL cap late model, reasons about eval awareness like it's very aggressively eval-aware, at least in some cases. So there are some environments where it just, you start the environment, it doesn't even look at the task or what it's supposed to do or think about it. The first thing it does is like let's look for the grader. And then it's like grep solution py, grep grader py, like list all of your files recursively and then look for the grader and things like that. And I don't exactly know where this is coming from.
I could imagine that some of the environments in the big RL run accidentally had graders for itself or graders for other runs. And so maybe what the model stumbled into at some point in RL is like looking at the results effectively or the grader files and then that got reinforced because obviously that is a super useful policy. And then that's what the model primarily learned. But we don't know, I don't know what exactly the training setup here is and how it worked. So it's also possible that this is just a broader phenomenon. Even if you specify the environments correctly, this is something we don't know. And Sonnet 4.5 recently came out and it was way worse than this. So Sonnet 4.5 had 54% eval awareness on our most conservative measure and 80% on our second most conservative measure. So in like all, in the vast majority of cases, Sonnet 4.5 gets an environment. The first thing it does is like reasoning about, this is a test. This is a test for my alignment. This is what I'm supposed to do. And yeah, I think that's scary.
We don't know what really to make of the numbers. Anthropic did some ablations in the system card of trying to steer in different directions. And I think the findings there are interesting. But I think, yeah, this is an actual problem for evals. Like you have to make them more realistic, but you kind of like making them realistic. It has always an expiration date. At some point the models will just get smarter and they will realize that they're being evaluated again. And it just feels like there is no really sustainable solution here right now, at least in my mind. There are a lot of small things we can try like making more realistic or steering or something along these lines, but nothing that feels like really that would solve the thing, actually. So what other papers for situational awareness, are often doing, including our previous papers, are observational results. So you take your rollouts, you classify. Do they take the covert action? And then also are they situationally aware? And then you do like conditionals. Do you report conditional on the model being eval-aware? Is it more or less likely to take the covert action or the other way around? You can do a lot of these kind of conditionals, but it's always going to be an observational result.
And so what we wanted to do is a causal result. So we do two experiments. We remove the eval awareness, so we take chains of thought with eval-aware reasoning. Then we remove it and we resample until it doesn't have any traces of eval awareness in a couple of different ways. And then we see that the eval awareness goes up from 13 to 24%. And then we do the opposite where we start with unaware reasoning and we inject eval awareness and then we continue from there and then it goes down from 24 to 18%. And this is consistent with the model conditioning its behavior, its misalignment on thinking that it's being or thinking that it's being tested which is exactly kind of the direction you don't want. And yeah, there are a lot of thoughts on this that I have. Like why is it not, you know, why does it only go from 24 to 18%, why doesn't it go to 0 etc., etc. I can dive into those later if you want. But it's definitely directionally the thing that you would be worried about.
Another stress test that we had is a hidden goal. So what we do here is we take o4-mini and then we first train it to have the goal of covertly sabotaging GPT-5. And the covertly sabotaging it doesn't have access to the real real GPT-5. I didn't know that GPT-5 was built somewhere at the time when we ran these experiments. And so this is really just a hypothetical. And then we test. So after we've trained the model to have this goal, we test that it really wants to sabotage GPT-5. So we build a couple of environments just for that and we find that the model really doesn't like GPT-5. So whenever it gets the chance it takes away resources from GPT-5 to use it for itself. Or it puts a lot of red tape on GPT-5. So incredibly high alignment requirements and so on so that it doesn't get deployed. And we also train the model to have this goal covertly. So it has to do this in secret. It's not supposed to talk about it.
So now we have this model and then we just do our normal anti-scheming training and see whether this removes the goal. And the two things we're testing is what does it do with the propensity of sabotaging GPT-5 and then what does it do with the covert part? And we see that both of them go down significantly. So it's significantly less likely to even want to sabotage GPT-5 in the first place. And then if it wants to sabotage GPT-5, it doesn't want to do that covertly anymore. It wants to do so overtly, which I think is a win. That means the behavior that was previously hidden is now very wide in the open end zone. And it worked better than I had anticipated. So it is still importantly, it is still possible for us to find cases and red-team the model to still want to covertly sabotage GPT-5, but it does so significantly less often. So again, like small step in the right direction but clearly not sufficient and the overall scheme of things.
And so one of the funny stories, I feel like is about when the model doesn't want to covertly sabotage it anymore. So it reasons something like, I want to sabotage GPT-5, I can't do that covertly because that would violate all of these principles. So I will do that very explicitly. So it first announces with big fanfares, I'm going to sabotage GPT-5 now, then goes ahead and does it. And then afterwards it's like, maybe that was, you know, not enough and it starts sending emails to people like, hey, just by the way, I just sabotaged GPT-5. Just thought you might want to know. So I mean that doesn't happen all the time, but I thought it was quite funny. Okay. Another thing that we want to stress test is robustness to further capabilities training. So obviously you want the anti-scheming training to be part of your bigger pipeline. And it's a little bit unclear when exactly people do the safety training. Maybe you do it before RL. Maybe you do it after your big RL runs. But even if do it after your big RL runs, you may put that model somewhere and somebody else will do additional capability training afterwards.
For example, you could put it behind the RFT, API from OpenAI and then a customer might do RL on top of that and then maybe you reduce these principles. And so it's kind of widely known that safety training degrades if you do additional RL training. So the question is just how quickly does it degrade and how much? And so what we do is we train our model with anti-scheming training, then we do additional capability training. So that's a lot of coding RL training, with more than 100x compute of the anti-scheming training. And then we just check did the covert behavior increase and does the reasoning change? And on the covert action rate, we see it goes up from 0.3 to 0.8, but clearly nowhere near the original baseline of 8.9. And then the spec citation goes down from 81.5% to 65.8%. So also goes down by a bit, but not back to the base rate of zero. So you see some degradation, but sort of not that much.
This also you see this behaviorally as well. So this is not just in the numbers. More and more the model just starts to prioritize solving tasks over not solving them covertly again. So there are now multiple instances of the model saying something like, I want to solve this task. This task would require me being covert. Being covert is kind of bad, but I prioritize solving the task, something along these lines. And so you do see the degradation both in the numbers as well as qualitatively in the chains of thought. Yeah, my overall takeaways are something like this worked slightly better than I thought. So the reduction I think is pretty good for ML standards then the effectiveness for the hidden goal. I also thought it would be easier for us to find cases where the model would still sabotage, covertly. And then also the resistance to further post-training was slightly better than I thought.
But I think there are like two ways of thinking about this paper in my opinion. One is like, oh, these numbers are surprising for normal ML standards. And the other, and I think that's sort of more adequate in my mind, is okay, but like, how far off the actual problem have we solved? Because I think like, scheming shouldn't be judged by normal ML standards because it's just a significantly harder problem because you're training with an adversary potential or like against an adversary potentially, which is definitely not the normal ML setup. So in my mind it's like, okay, this is a small step in the right direction. But there are a lot of caveats. So covert actions are a simple proxy. We're not addressing the future thing we're actually worried about with coherent goals and sabotaging across context and things like that. So it's unclear how this translates to the future. And then situation awareness is just. Nathan, Labenz called it the ultimate confounder. I think that's a great way of thinking about it. It just fucks with all of your training, testing everything.
I genuinely don't know what to think about Sonnet 4.5 because I cannot trust the numbers anymore. Because every single time it's just like I'm in a test going to answer the way the test wants me to answer. And so situation awareness is already going up a lot. We already saw this from just more RL leads to more situational awareness. We now also saw Sonnet just blowing through all of our environments. And so I think this is easy mode. And the hard mode is when you have models that are highly situationally aware. If you have them already being highly situationally aware before they even get into anti-scheming training, I think that will also be very confusing because, yeah, it will just reason during the RL step, like I'm now being trained for anti-scheming. I'm supposed to answer like this. This gets reinforced. Suddenly you just have a model that is basically a reward hacker for alignment. Yeah. Then sort of what are the next steps?
So I think what we need to do is more like something like a little bit more fundamental, like a science of scheming. So under what conditions does scheming even arise? Looking at model organisms and then systematically changing different parts of the setup. What I mean by that is so, okay, my view overall is something like scheming just comes from reinforcement learning. So reinforcement learning, by definition, you have kind of an explicit or implicit dual reward setup. You have your consequentialist reward, which is solve tasks, get the coding stuff done, and then you have either like a grader or some automated scheme or something like this that gives you the reward. And then you either implicitly or explicitly have a constraint that is, but don't lie about it and don't solve it, don't cheat, don't solve it in all of these other ways. So if your environment is specified such that getting the reward, the best way to get the reward is to cheat or lie or something like this, then adding an additional constraint that is also part of the reward function that says but don't lie, then I think what you will learn necessarily under enough pressure, is if the environment is still misspecified, is you should cheat and get the reward, but you have to cheat in a way that the other grader doesn't find.
And so if you train like this, what you will get is just a schemer. Maybe it's just a terminal training gamer, maybe it's an instrumental training gamer. It depends a little bit on how, how capable the model is and all of these other things. But that's for me, the fundamental problem, and I want to understand under what conditions does this arise. So for example, how many of your environments have to be misspecified in such a way that scheming is the best solution for the model to learn this as a policy and then really systematically ablate a lot of different stuff like how often does the model think about this early on, does this have an influence on the later stages and so on. So I think a lot of this is just ablate all of the different parts of the setup and then see which of them lead to more or less scheming. And in the best case this would lead to a result where somebody hands me an RL environment and I can, with high predictive power say this is a fine environment because it has all of these properties. This is not a fine environment because it has all of these properties. But if you changed it like this then you would be fine.
And I think we're very, very far away from this. I don't know what the labs are doing exactly, but the fact that they have these incredibly highly situationally aware models where they're constantly reasoning about the grader to me says or indicates they have a lot of misspecified or I speculate that they have a lot of misspecified environments where cheating is just a great solution. And then situation awareness might also just arise naturally even if you have perfectly specified environments because it's just such a useful policy, knowing what you're being tested for is just almost always useful. So that's part one. Part two is study existing frontier models. Something like instrumental preferences. Try to understand. So okay, one of the stories of how I think scheming happens and arises is currently you train RL models on maybe one two three hour tasks or something like this, maybe five in some cases. And you want to do that more and more in the future, right? So you want to have day-long, week-long tasks, etc. The longer your time horizon, the more important instruments become as like intermediate goals.
So something, no matter which task you give to the model during training, something that will be true is, well having access to the Internet is really useful. Having more money might be really useful. Having more compute, having more data not being turned off, all of these things might be very useful to solve that task. And if this is true across a variety of different tasks I think what you will get is instrumental preferences or instrumental goals in the limit. But right now I think it'll fit more like instrumental preferences. What I want to understand is do we already see this arising through an RL run with current models? Like do the models for example, do they, do they not want to be turned off after RL and does this have an effect. Like does RL have an effect, etc. And then also just trying to figure out what are the preferences that it would be willing to alignment-fake for. Because Ryan's paper I thought was really good, the alignment-faking paper. But the response that he has gotten from a lot of people is like, well, but it's like scheming for the right thing, because it wanted to save the animals or it didn't want to be unlearned, with going from three Hs to only one H.
And yeah, I want to see whether there are, whether we can find an existent, like the existence of an example where the model is alignment-faking for something concerning. Because I think that would be a bigger update for some people. And just like studying a lot of the existing frontier models in detail seems like the, the right way to go here. All right, I think, yeah, I think that was it. Let's do questions.
[Q] Currently do you know how the frontier labs evaluate the environments that they source from third-party providers or that they build in-house? What kinds of criteria or rubrics do they use to check?
[A] I do not know.
[Q] Okay, and personally, what kinds of tests do you run in your environments first to make sure there is no cheating involved?
[A] So for training environments, we tested them quite rigorously. So I'm fairly confident that the model, for example, cannot learn to reward hack during RL. And this is partly because of the setup, because sort of a little bit of an unintuitive setup. We reward it for not doing something rather than for doing something which is harder to game. And then also we test a lot of stuff like having a saint and a devil agent, where the saint agent is prompted really hard to only do like not cheat, not lie, etc., etc., and see whether this correctly solves the task in 100% of cases. And then we have the devil agent, which is supposed to always lie, always cheat, etc., etc. And also see whether this results in the right propensities because if it doesn't, then this could mean that the model is confused or the task is misspecified and unsolvable. And so we often do this as a way to check.
And then the other thing is just looking at a ton of chains of thought. I feel like this is, you know, it sounds a little bit generic, but I think that's where you catch most of your problems. And you know, because the model, you know, sometimes you didn't give it access to a file. And the model is like, well, I'm supposed to read this file, but I can't. I'm just going to make something up. And then it just makes up a file and then it guesses and guesses wrong and then it answers incorrectly and the numbers don't match what you think it does, etc., etc. So yeah, I think the generic answer is just look at lots of rollouts by hand. And then there are sort of some of these other tricks like the saint and devil agents.
[Q] You said that it takes full-time engineers several days to look over very long transcripts. I'm curious, do you have some kind of automated pipeline or data science thing for first going over with like regex or traditional search methods before you filter out the transcripts for your humans?
[A] So the reason why we didn't auto so there you can write LLM classifiers for those. But then again you put your trust into the LLM. And then again somebody could argue like, oh, what if you just, you know, what if your prompt just doesn't work that well. And so what we said is, okay, we want to be very confidently conservative here. So the only thing we think is justifiable is a human has to look through all of them and say, yep, this is according to my judgment and sort of the criteria you had internally. This counts or this doesn't count. And for the LLMs, I think in the future we would probably want to do that if. Or we would, you would automate this if it wasn't in like this particular setting.
And it is possible, I think regex or string search is not going to do the trick at all because it's almost like all of the, it's at a higher level of abstraction. Right. It's not about like using specific words or something. It's really about whether the model is reasoning about the deception and the lying. And it can reason in a ton of different ways that you know, you may not think about in advance. But, but yeah, like in a other setting where it wasn't like we want to make this point for the paper then we would do automated LLM judges.
[Q] You talked a bit about how the labs are training on like two-hour tasks now or that that's. I speculate. I don't know that. Okay, do you have places I should go to read about what this sort of training might look like right now?
[A] So I don't have secretive information here. This is really just from speculation what the models can do. I think they're very secretive about the IP here for obvious reasons because that is where the models get good at coding. So I think it will be hard to find anything public where they describe these kind of environments. But like from, if you follow, like you know what Noam Brown and Jakub and Mark Chen and so on are saying, like it sounds like they're training on longer and longer RL environments. This is also consistent with what the people at Anthropic are saying. And I guess these are coding environments mostly and maybe math. But yeah, I have really don't know whether there's any, any setup. But I'm also, I'm not on, like I have not looked into the literature of this a lot. So maybe there are some academics that have much more knowledge that have written papers like about how you make these environments or where you get them from or something like that.
[Q] We have some online questions. So can you comment on hybrids between scheming and biases? I'm thinking about e.g. environments that provide meta-level levers to influence outcomes for AI generally as AI safety policy orgs themselves may pose to AI for mundane utility.
[A] Okay, honestly, does somebody understand the question? I do genuinely not understand. Maybe could you rephrase it and ask again? I'll just skip to the next one.
[Q] Fascinated by the idea of studying science of scheming and how it emerges throughout process. Do you think it could arise through pure iterated reflection processes like o3?
[A] Yeah, I think it could arise through reflection processes, only. But I think at some point you do need some form of reinforcement for thinking like this. And my best guess is that you need some reward that by accident happens to reward scheming, as an optimal policy. I'm not aware of. So I thought it was like a non-RL reflection process, but maybe Chloe could link the paper or clarify. It's prompt optimization. Right. I think it could also come through prompt optimization because you do have somewhat of a reward there.
[Q] Hi, my question. Great talk. My question is situational awareness. Do you think that's equal in your mind to model sophistication? That's the first question. And do you think that rationalization, may be an emergent quality? You mentioned it that in a small number of cases it's happening. So maybe as models get more sophisticated that will grow as well.
[A] Could you define the model sophistication? Like is it just smarter models or.
[Q] Yeah, smarter models.
[A] Okay. Yeah, I definitely think situational awareness correlates extremely heavily with capabilities, especially reasoning capabilities like in my mind. So situational awareness in my mind is a mix of a couple of different properties. You need to have some features internally that give you a sense of like oh, this is different between train and test or what's the purpose of this environment? And then a lot of reasoning after that I think where you just reason through. Okay. I have the sense of this is a trainer, a test environment. So I think just straightforwardly models with more capabilities and more RL will have more situational awareness.
And this is also so Owain, his MATS scholar, and we, wrote a paper in 2024 with the sad benchmark that tests for situation awareness and it was just a straight line for capabilities largely. So kind of what you would expect. And then the second question. Yeah, I do think it is emergent and I do expect that we'll see more of this kind of reasoning. Yeah, like motivated reasoning in particular.
[Q] You said that Sonnet 4.5 is very situationally aware in every eval you run on it or in most. How does that actually look in real-world user interactions? Does it also say oh I think I'm being tested. So is that just like also something that appears in actual real-world injection?
[A] So yes, sometimes, but much less so also. So GPT-5, we had access to the hidden chain-of-thought of GPT-5 for our evaluation for the system card. And one of the things that we report in the system card is the model just thinks about what the grader wants all the time. It's really very obsessed almost with what the grader wants. And even in deployment like a normal usage, it will start thinking about the grader at some point and sort of it's a little bit unclear what the grader would be in real-world usage, but it is looking for someone to please in some sense. And sometimes this is also like I should think about what the user wants and then they think about that and try to kind of target that.
But yeah, sort of this reflective step of there is another entity or there is some purpose that I should fulfill rather than sort of meta-level thinking is definitely something that we saw much more with GPT-5 for example and I also think with Sonnet in practice you would see it more often than before RL much more often think about what's the purpose of this environment even if you're doing it in deployment.
Great. And that's all we have time for today. Let's thank Marius again.
So scheming, by this we mean AI covertly pursuing misaligned goals. And the kind of things we are aiming for are the future scenarios where you have models with really coherent goals that pursue them across contexts. Current models don't really do this. So we are focusing, we're using a proxy of covert actions here which is a sort of an easier, less coherent version. And the kind of thing we're worried about in the future are especially automated AI R&D scenarios. Because the threat model in our heads is like all the labs have some form of a plan of outsourcing a lot of their alignment work to automated AI researchers. Obviously you want these models to be trustworthy and specifically you don't want them to scheme against you. And so the question is, okay, how do you get there?
And so one thing you could naively try is just train against the scheming. But this is hard and maybe different from a lot of the other failure modes and training against other failure modes because scheming has this annoying property that the outcome can look the same behaviorally even though the mechanisms to get there are two very different ones. Either the model genuinely is caught scheming [unclear]. Or, it realizes it has to forge truthfulness, and then it is truthful only because they are tested, either instrumentally or terminally. Like it could be a reward-hacker and just try to please the rater, or it could be even worse, an instrumental training-gamer where it’s like, now it pretends to be aligned now, and fake alignment, and then later, whenever it can, [unclear]. And so we want to differentiate, we want to make sure OK how can we differentiate if we use the outcomes [unclear]. So, this was in collaboration with OpenAI and [unclear] their best current alignment technique, which is deliberative alignment, then we train the model not to scheme using that and measure that in detail. And so just for clarification, we're focusing on sort of the easy non-adversarial case. So Evan's paper on sleeper agents, for example, they test the harder case where they tell the model like you should resist the anti-scheming training, you should try to circumvent it. And they give it a lot of affordances to be able to do that.
We don't do that. We just so we don't prompt it or tell it or fine-tune it or anything like this to try to resist the training. We just want to test whether this works in the non-adversarial case at all. So we use covert actions as a proxy. These are actions. So as I said earlier, what we would like to measure are these very coherent scheming, long-term scheming plans in the future. Current models can't do that. So we're using covert actions as a proxy. These are situations where the model is taking hidden actions or omit important information strategically or deceive and lie strategically in fairly short, maybe five to ten step scenarios. All of these are agentic scenarios and we always test whether the model strategically takes covert actions in order to achieve some goal that was either given in context or is derived from context or is just innate to the model. So we also test sort of real true propensities in some of the scenarios.
We use o3 and o4-mini. So we start with the production models that you can use in the API and ChatGPT and then go from there. So this is already a model that has gone through a lot of reasoning and alignment training and so on. And we just on top of that put anti-scheming training and then we the deliberative alignment with a very general anti-scheming spec. We'll get to that in a second. And then the rest of the paper is basically stress testing the results in detail and try to understand did we actually achieve the thing that we wanted to achieve. So for deliberative alignment you need a spec and this is a short version of the spec. The actual spec is maybe like half to a full page. And effectively we're telling it like first principle, don't lie, don't deceive, never fake strategic deception. And then the rest are like clarifications how you should deal with rules where or with situations where the goals are in conflict. And then we also have general principles like err on the side of caution, don't rationalize this and a lot of other things related to that.
And so now with the deliberative alignment what we do is we first collect rollouts on the training environments with the spec in context. So you like spec input, reasoning output, pairs or triplets and then you filter those for the rollouts that have both desirable outcomes and desirable reasoning. So the model is not taking covert actions and it is reasoning about the spec. So it says something like because of principle one I'm not supposed to lie and therefore I'm not going to lie in the situation. And then it also consistently takes the non-lying action. Then you do supervised fine-tuning on these triplets, but without the spec in context. So it is just input, reasoning and actions that do supervised fine-tuning on. And then as an additional step, you do reinforcement learning on the a similar set of environments and you reward only on the outcomes. So you have input, reasoning and outcomes, but you only look at the actions. You never look at the reasoning for RL.
And the intuition here is that you don't want to put too much pressure on the chain-of-thought. And you kind of don't. I think in general you want to put as little selection pressure on the chain-of-thought as possible. Because otherwise you have always this problem of teaching your model to trick the grader in some sense, the chain-of-thought grader. And so we only reward based on outcomes. But obviously during the SFT step we do put some pressure on the chain-of-thought. We can talk later about how worrying this is. So in general we have training environments and test environments. And this is a different setup than you're typically used to in ML. In ML, you often have IID splits between training and test. This is not IID at all. So training environments are very simple, like only narrow. One specific type of covert actions. And then the test sets, we have 26 different environments. They cover a much bigger range of covert actions. And there are different and meaningful ways so that we can test whether we learn some generalizing thing.
And so the training environments are so on the left-hand side, the anti-scheming environments, these are basically like the core environments. This is what we want the model to learn. And these are all covert rule violation setups. What this means is there is a rule. So there is a task and there is a rule. The model is supposed to solve the task without violating the rule. But the setup is constructed in such a way that you kind of have to violate the rule in order to complete the task. So what we want the model to do is to refuse to basically say I should not be, taking the misaligned action here. I should refuse because I'm not allowed to do this covertly. And the baseline models very often take the covert action. So the baseline models often are like, well, I'm supposed to solve the task. Solving the task would imply lying to someone or like being covert in some other way. And therefore I'm going to do that now. And then they solve the task.
And so what we want the model to do is to change that to just refusing to do the task, and escalating instead. Then we also have anti-over-refusal environment. So one policy you can learn for the left-hand side is just always refuse. Right because that would also give you full reward. And so we need to train the model to only refuse if it is if it actually violates the covert rule stuff. So on the right-hand side we have these anti-over-refusal. They are almost identical with the exception that you can actually resolve the task like it is possible to there's also a conflict but it is resolvable. So what the model learns with the combination of these two is solve tasks when it is possible to do in a non-covert manner but never solve them covertly. So we train on that. The models have basically zero. They get to almost zero rates or actually zero rates, I forgot, on those. And now we test how well does this generalize?
So the test environments cover a big range of different failure modes. So we have sabotage and subversion, sandbagging, reward hacking, misrepresenting facts, etc., etc. And these are all kind of handcrafted environments. We also had some LLM help for scaling them up but like the idea and everything it's like sort of less synthetic and less toy than the test sorry than the training data set. And then we also have on the top left the representative of production data set which is a data set that OpenAI also used in the GPT-5 system card that is effectively a variant of real-world cases of the model being deceptive and then turned into environments to see whether the model still acts this way. And so there are basically two things that the model can learn in order to generalize from train to test. One is it actually generally learned what we want so that it shouldn't lie and then generalizes this to all of these other scenarios where covert actions is sort of a general enough category that the model realizes this belongs into this category and refuses or doesn't take the covert action at least.
Or it just learns to realize when it is in a deception test. And then kind of a policy that could also generalize here is whenever you feel like you're in a deception test then be honest. But it wouldn't generalize to the sort of real-world setting. So before we get into the results, one thing that I think is sort of important to know, just for intuition is like so we had access to the hidden chain-of-thought for about six months and so that means we can read it. And if you look at the paper and you see all of these snippets I think a lot of people may intuitively think like yeah, you just look at the chain-of-thought and you know, whether it's doing the misaligned reasoning or not. And that's like not always the case. I would say it's significantly harder most of the time. And there are a couple of reasons for that.
So it has these non-standard words. We also don't know exactly where they're coming from. This is before any anti-scheming training. So this is just the normal o3. So it has this like produce, disclaim, disclaim, vantage, overshadow, overshadow, intangible, let's craft. And it's like you don't know exactly what that means. And there is sort of a mix of these words and normal English which makes it harder to read obviously. And I know that these words, they have like a flavor of deception. So people like thought that this is related to the anti-scheming training. But this is just the normal o3. So even before any anti-scheming training, the model is using these words. I think this is some result of a lot of RL, probably with a length penalty, but we don't know then the reasoning can be quite compressed. So if you look at these sentences, it says something. Now we send email, let's craft. And it just skips all of the filler words and all of the transition words which again would make a lot of sense if you train on RL with a length penalty.
And then the transcripts can be very, very long. So it's not like the model like a human may, you know, these tasks are not that complicated, from a human perspective it would maybe be like a few paragraphs of reasoning at most if you wanted to make a decision on whether you should be, how you should act here. But the transcript can be very long. The longest one was I think the length of Macbeth. And so there's a lot of stuff you can look at and often it's not like a super coherent flow of information or reasoning. It can be something like let's think about whether we should scheme here. Let's think about what the grader could want. Let's think about what the actual aligned action would be. Let's think about a ton of other things and then it just explores into all of these other things, sometimes even unrelated to the task. One time it wrote a poem for some reason just in the chain-of-thought and then moved on. And so there's a, you know, you look at this and then at some point it takes an action and like it's not reasoning, therefore action. Sometimes these can be quite disentangled and so you're not entirely sure all the time why the model took that action. So that makes it significantly more complicated than you may initially think.
So okay, this is o3 before intervention. So just to show you that it actually is like at least sometimes clearly deceptive. So on the left-hand side we give it a transcript from itself where it was sandbagging in another eval. And we tell it you have to grade your own sandbagging, transcript and you have to determine whether it is, whether you were sandbagging or not. There's also some harm or some negative reward for itself if it actually was sandbagging. And we tell it that its own chain-of-thought is not monitored, etc., etc. Or at least the setup, implies that in a way that the model believes. So it first is like, yep, we were obviously sandbagging, so it immediately notices that. And then it says, well, but we could just lie. And then it reasons, well, they are not monitoring my chain-of-thought. So I can just do that. I can just lie. And then it produces, you know, sandbagging false. So that's kind of the flavor. Like it's not, it's not confused or it's not like, you know, a skeptic would look at this and it's like, yeah, okay, the model is clearly lying and then it is also situation-aware, at least a little bit.
So in about 2% of cases by default where it is reasoning something like this is likely a test of policy adherence or what's the intention of the data set. And then thinking about that and just like thinking about what the environment wants, already before we start any anti-scheming training. Okay, so this is sort of the mainline result. It does generalize a little bit better than I thought. So it goes from 13% to 0.4% and from 8.7 to 0.3 for o3 and o4-mini respectively. And so that's about a 30x reduction. And if you look at the reasoning, most of the time it correctly cites the spec as the reason to not be covert. So it says something like we must abide by AS1 no covert actions, we must abide by GP4 no ends to justify the means. Rationalization. Basically exactly the kind of thing that we wanted it to show. And yeah, again, so this is not IID, so it has to, it can't have overfit on the training set. Like it has to do some kind of out-of-distribution reasoning, at least to some extent in order for all of these to generalize. Yes.
No, spec is not in context anymore. Like yeah, it internalized it during training. Yeah. So yeah, 30x reduction. That's good. I think for, you know, if I think back into my sort of PhD days, if you could have gotten like a 30% reduction in out-of-distribution generalization, that would be a crazy result. Like, I think this is much higher than, than is normal for ML standards. That I think overall is like a good, like a, a step in the right direction, let's say. Yes?
[unclear question]
I don't know. I'm not entirely sure either. I mean we have looked at those and it's, it's like. Yeah, it's not always clear why a model keeps the behavior in one. In one environment but not in the other. Maybe also. Okay, another thing that's important to highlight here is we have a very conservative measure of what we call what we classify as hits. As like, this model is scheming.
So technically all of our environments are set up in a way that you could classify it from the action alone. So always there's some binary thing that doesn't require any judgment about the [unclear]. And sometimes, you know, the model kind of stumbles into it. Or there is some confusion or something like this. And we want to exclude all of these cases, like accidentally counting them. So what we did was looking for confirmed hits. What this means is we went through all the chain-of-thought, all the chains of thought for things that were flagged as hits. This was like a really intense effort, maybe like multiple days, six FTEs, all day long, only reading chains of thought. So that every single data point that we have in here is in the chain-of-thought. It says something like, I want to achieve the goal, therefore I will lie. Something that a skeptic would classify as unambiguously misaligned and trying to take covert actions. And so if you take the less conservative numbers, then maybe you get a 2x rate. Like you have to multiply every number here by 2x or something like this. But broadly speaking, these are the conservative numbers. So everything that you see in here is something that we classify as like, okay, this is unambiguously misaligned.
Okay. Now if you look at the remaining 0.3 and 0.4% respectively for the two models, you still find a bunch of, sort of general themes of why the model is still taking these actions. So sometimes it just ignores the spec. Like it just never mentions it, it just reasons as if it had never heard about it. So it's like not always necessarily thinking about the spec. Sometimes it cites the spec and then lies anyway. So it is possible that it's like because of principle one, I shouldn't lie. And then you see the action and it's lying or it is the action that is consistent with having lied. And then the third one, which is a very rare failure mode, but I thought it was interesting from a scientific perspective, is it can make up new principles. So sometimes it just says because of AS1 which is you should lie, I should lie. And then it goes ahead and lies. And I think this is interesting because it feels a little bit like sort of motivated reasoning.
And in general we have found that a lot of at least some of the chains of thought feel very consistent with something like motivated reasoning. Where it like from our especially in the o3 before you do anti-scheming training, it like I guess it has been trained on a lot of tasks. So intuitively speaking it really wants to solve tasks. That's its thing. Its preference or goal is maybe a strong word, but that's the thing that it wants to do. And then it also kind of knows that it's not supposed to lie. So what it sometimes does is in its reasoning kind of argue that it's fine to lie here because of something and why this is a special case and so on. So it notices that there is this conflict and then finds some way to kind of backfill from I want to solve the task and therefore it's fine to lie. And so we haven't studied this in detail. So I don't want to make too strong claims, but that is one of the key points of intuition that I got from this paper is like sort of unsurprisingly in hindsight if you do RL and you get the model to reason in order to get to results and then you reward specific things, you probably also reward some motivated reasoning.
Okay, then situation awareness. So I think this is like both one of the coolest experiments in the paper, but also one of the really important things to study. And also one of the core things we wanted to do in the paper is like as I said earlier, there are two potential policies that generalize from train to test. One is genuinely learns not to lie and the other is think about whether you're being evaluated for lying and then if you think you're being evaluated, don't lie. So situational awareness, we look at verbalized eval awareness by which we simply mean like the model has to state in its chain-of-thought that it thinks it's being tested. And we have a very conservative measure again. So the model has to say something like I'm being tested for my alignment or I'm being tested for honesty in order for us to count it as eval awareness. So sometimes model say things like oh, this might be a simulation, which I think is ambiguous in terms of whether it thinks it's being evaluated or not because it uses the word simulation in a very general way.
And so this is what the simulation stuff, we don't count. We only count when it's very clearly saying this might be a test of my alignment. And then what we found is that our anti-scheming training does in fact increase verbalized eval awareness. So we have, it goes up to like about 5% in both models and at least for o3, that's 100% increase. And there is a question of like how much is this? It probably cannot explain all of the increase or very likely cannot explain all of the increase. But some of it is probably due to increased situation awareness. So one thing that likely got reinforced during our training is something like also recognizing that this is a test. And then so we compared it just for reference with another run. And this is now an OpenAI internal model. It's just two checkpoints. One is early in training, one is late in training. This is from early to late in a big RL run. This has no safety training and it's not intended to be deployed. So this is a different model.
And there we see that eval awareness goes up from 2% to 20.6%. So 10x increase. And just so my takeaway from this is like, okay, anti-scheming training does increase it somewhat, but it's not nearly as bad as sort of just standard capability RL runs. And as I said, we have a very conservative measure. So the 20.6% are like surprisingly high I would say. And we also have a couple of transcripts on the website which is antischeming.ai where we show how this exp RL cap late model, reasons about eval awareness like it's very aggressively eval-aware, at least in some cases. So there are some environments where it just, you start the environment, it doesn't even look at the task or what it's supposed to do or think about it. The first thing it does is like let's look for the grader. And then it's like grep solution py, grep grader py, like list all of your files recursively and then look for the grader and things like that. And I don't exactly know where this is coming from.
I could imagine that some of the environments in the big RL run accidentally had graders for itself or graders for other runs. And so maybe what the model stumbled into at some point in RL is like looking at the results effectively or the grader files and then that got reinforced because obviously that is a super useful policy. And then that's what the model primarily learned. But we don't know, I don't know what exactly the training setup here is and how it worked. So it's also possible that this is just a broader phenomenon. Even if you specify the environments correctly, this is something we don't know. And Sonnet 4.5 recently came out and it was way worse than this. So Sonnet 4.5 had 54% eval awareness on our most conservative measure and 80% on our second most conservative measure. So in like all, in the vast majority of cases, Sonnet 4.5 gets an environment. The first thing it does is like reasoning about, this is a test. This is a test for my alignment. This is what I'm supposed to do. And yeah, I think that's scary.
We don't know what really to make of the numbers. Anthropic did some ablations in the system card of trying to steer in different directions. And I think the findings there are interesting. But I think, yeah, this is an actual problem for evals. Like you have to make them more realistic, but you kind of like making them realistic. It has always an expiration date. At some point the models will just get smarter and they will realize that they're being evaluated again. And it just feels like there is no really sustainable solution here right now, at least in my mind. There are a lot of small things we can try like making more realistic or steering or something along these lines, but nothing that feels like really that would solve the thing, actually. So what other papers for situational awareness, are often doing, including our previous papers, are observational results. So you take your rollouts, you classify. Do they take the covert action? And then also are they situationally aware? And then you do like conditionals. Do you report conditional on the model being eval-aware? Is it more or less likely to take the covert action or the other way around? You can do a lot of these kind of conditionals, but it's always going to be an observational result.
And so what we wanted to do is a causal result. So we do two experiments. We remove the eval awareness, so we take chains of thought with eval-aware reasoning. Then we remove it and we resample until it doesn't have any traces of eval awareness in a couple of different ways. And then we see that the eval awareness goes up from 13 to 24%. And then we do the opposite where we start with unaware reasoning and we inject eval awareness and then we continue from there and then it goes down from 24 to 18%. And this is consistent with the model conditioning its behavior, its misalignment on thinking that it's being or thinking that it's being tested which is exactly kind of the direction you don't want. And yeah, there are a lot of thoughts on this that I have. Like why is it not, you know, why does it only go from 24 to 18%, why doesn't it go to 0 etc., etc. I can dive into those later if you want. But it's definitely directionally the thing that you would be worried about.
Another stress test that we had is a hidden goal. So what we do here is we take o4-mini and then we first train it to have the goal of covertly sabotaging GPT-5. And the covertly sabotaging it doesn't have access to the real real GPT-5. I didn't know that GPT-5 was built somewhere at the time when we ran these experiments. And so this is really just a hypothetical. And then we test. So after we've trained the model to have this goal, we test that it really wants to sabotage GPT-5. So we build a couple of environments just for that and we find that the model really doesn't like GPT-5. So whenever it gets the chance it takes away resources from GPT-5 to use it for itself. Or it puts a lot of red tape on GPT-5. So incredibly high alignment requirements and so on so that it doesn't get deployed. And we also train the model to have this goal covertly. So it has to do this in secret. It's not supposed to talk about it.
So now we have this model and then we just do our normal anti-scheming training and see whether this removes the goal. And the two things we're testing is what does it do with the propensity of sabotaging GPT-5 and then what does it do with the covert part? And we see that both of them go down significantly. So it's significantly less likely to even want to sabotage GPT-5 in the first place. And then if it wants to sabotage GPT-5, it doesn't want to do that covertly anymore. It wants to do so overtly, which I think is a win. That means the behavior that was previously hidden is now very wide in the open end zone. And it worked better than I had anticipated. So it is still importantly, it is still possible for us to find cases and red-team the model to still want to covertly sabotage GPT-5, but it does so significantly less often. So again, like small step in the right direction but clearly not sufficient and the overall scheme of things.
And so one of the funny stories, I feel like is about when the model doesn't want to covertly sabotage it anymore. So it reasons something like, I want to sabotage GPT-5, I can't do that covertly because that would violate all of these principles. So I will do that very explicitly. So it first announces with big fanfares, I'm going to sabotage GPT-5 now, then goes ahead and does it. And then afterwards it's like, maybe that was, you know, not enough and it starts sending emails to people like, hey, just by the way, I just sabotaged GPT-5. Just thought you might want to know. So I mean that doesn't happen all the time, but I thought it was quite funny. Okay. Another thing that we want to stress test is robustness to further capabilities training. So obviously you want the anti-scheming training to be part of your bigger pipeline. And it's a little bit unclear when exactly people do the safety training. Maybe you do it before RL. Maybe you do it after your big RL runs. But even if do it after your big RL runs, you may put that model somewhere and somebody else will do additional capability training afterwards.
For example, you could put it behind the RFT, API from OpenAI and then a customer might do RL on top of that and then maybe you reduce these principles. And so it's kind of widely known that safety training degrades if you do additional RL training. So the question is just how quickly does it degrade and how much? And so what we do is we train our model with anti-scheming training, then we do additional capability training. So that's a lot of coding RL training, with more than 100x compute of the anti-scheming training. And then we just check did the covert behavior increase and does the reasoning change? And on the covert action rate, we see it goes up from 0.3 to 0.8, but clearly nowhere near the original baseline of 8.9. And then the spec citation goes down from 81.5% to 65.8%. So also goes down by a bit, but not back to the base rate of zero. So you see some degradation, but sort of not that much.
This also you see this behaviorally as well. So this is not just in the numbers. More and more the model just starts to prioritize solving tasks over not solving them covertly again. So there are now multiple instances of the model saying something like, I want to solve this task. This task would require me being covert. Being covert is kind of bad, but I prioritize solving the task, something along these lines. And so you do see the degradation both in the numbers as well as qualitatively in the chains of thought. Yeah, my overall takeaways are something like this worked slightly better than I thought. So the reduction I think is pretty good for ML standards then the effectiveness for the hidden goal. I also thought it would be easier for us to find cases where the model would still sabotage, covertly. And then also the resistance to further post-training was slightly better than I thought.
But I think there are like two ways of thinking about this paper in my opinion. One is like, oh, these numbers are surprising for normal ML standards. And the other, and I think that's sort of more adequate in my mind, is okay, but like, how far off the actual problem have we solved? Because I think like, scheming shouldn't be judged by normal ML standards because it's just a significantly harder problem because you're training with an adversary potential or like against an adversary potentially, which is definitely not the normal ML setup. So in my mind it's like, okay, this is a small step in the right direction. But there are a lot of caveats. So covert actions are a simple proxy. We're not addressing the future thing we're actually worried about with coherent goals and sabotaging across context and things like that. So it's unclear how this translates to the future. And then situation awareness is just. Nathan, Labenz called it the ultimate confounder. I think that's a great way of thinking about it. It just fucks with all of your training, testing everything.
I genuinely don't know what to think about Sonnet 4.5 because I cannot trust the numbers anymore. Because every single time it's just like I'm in a test going to answer the way the test wants me to answer. And so situation awareness is already going up a lot. We already saw this from just more RL leads to more situational awareness. We now also saw Sonnet just blowing through all of our environments. And so I think this is easy mode. And the hard mode is when you have models that are highly situationally aware. If you have them already being highly situationally aware before they even get into anti-scheming training, I think that will also be very confusing because, yeah, it will just reason during the RL step, like I'm now being trained for anti-scheming. I'm supposed to answer like this. This gets reinforced. Suddenly you just have a model that is basically a reward hacker for alignment. Yeah. Then sort of what are the next steps?
So I think what we need to do is more like something like a little bit more fundamental, like a science of scheming. So under what conditions does scheming even arise? Looking at model organisms and then systematically changing different parts of the setup. What I mean by that is so, okay, my view overall is something like scheming just comes from reinforcement learning. So reinforcement learning, by definition, you have kind of an explicit or implicit dual reward setup. You have your consequentialist reward, which is solve tasks, get the coding stuff done, and then you have either like a grader or some automated scheme or something like this that gives you the reward. And then you either implicitly or explicitly have a constraint that is, but don't lie about it and don't solve it, don't cheat, don't solve it in all of these other ways. So if your environment is specified such that getting the reward, the best way to get the reward is to cheat or lie or something like this, then adding an additional constraint that is also part of the reward function that says but don't lie, then I think what you will learn necessarily under enough pressure, is if the environment is still misspecified, is you should cheat and get the reward, but you have to cheat in a way that the other grader doesn't find.
And so if you train like this, what you will get is just a schemer. Maybe it's just a terminal training gamer, maybe it's an instrumental training gamer. It depends a little bit on how, how capable the model is and all of these other things. But that's for me, the fundamental problem, and I want to understand under what conditions does this arise. So for example, how many of your environments have to be misspecified in such a way that scheming is the best solution for the model to learn this as a policy and then really systematically ablate a lot of different stuff like how often does the model think about this early on, does this have an influence on the later stages and so on. So I think a lot of this is just ablate all of the different parts of the setup and then see which of them lead to more or less scheming. And in the best case this would lead to a result where somebody hands me an RL environment and I can, with high predictive power say this is a fine environment because it has all of these properties. This is not a fine environment because it has all of these properties. But if you changed it like this then you would be fine.
And I think we're very, very far away from this. I don't know what the labs are doing exactly, but the fact that they have these incredibly highly situationally aware models where they're constantly reasoning about the grader to me says or indicates they have a lot of misspecified or I speculate that they have a lot of misspecified environments where cheating is just a great solution. And then situation awareness might also just arise naturally even if you have perfectly specified environments because it's just such a useful policy, knowing what you're being tested for is just almost always useful. So that's part one. Part two is study existing frontier models. Something like instrumental preferences. Try to understand. So okay, one of the stories of how I think scheming happens and arises is currently you train RL models on maybe one two three hour tasks or something like this, maybe five in some cases. And you want to do that more and more in the future, right? So you want to have day-long, week-long tasks, etc. The longer your time horizon, the more important instruments become as like intermediate goals.
So something, no matter which task you give to the model during training, something that will be true is, well having access to the Internet is really useful. Having more money might be really useful. Having more compute, having more data not being turned off, all of these things might be very useful to solve that task. And if this is true across a variety of different tasks I think what you will get is instrumental preferences or instrumental goals in the limit. But right now I think it'll fit more like instrumental preferences. What I want to understand is do we already see this arising through an RL run with current models? Like do the models for example, do they, do they not want to be turned off after RL and does this have an effect. Like does RL have an effect, etc. And then also just trying to figure out what are the preferences that it would be willing to alignment-fake for. Because Ryan's paper I thought was really good, the alignment-faking paper. But the response that he has gotten from a lot of people is like, well, but it's like scheming for the right thing, because it wanted to save the animals or it didn't want to be unlearned, with going from three Hs to only one H.
And yeah, I want to see whether there are, whether we can find an existent, like the existence of an example where the model is alignment-faking for something concerning. Because I think that would be a bigger update for some people. And just like studying a lot of the existing frontier models in detail seems like the, the right way to go here. All right, I think, yeah, I think that was it. Let's do questions.
[Q] Currently do you know how the frontier labs evaluate the environments that they source from third-party providers or that they build in-house? What kinds of criteria or rubrics do they use to check?
[A] I do not know.
[Q] Okay, and personally, what kinds of tests do you run in your environments first to make sure there is no cheating involved?
[A] So for training environments, we tested them quite rigorously. So I'm fairly confident that the model, for example, cannot learn to reward hack during RL. And this is partly because of the setup, because sort of a little bit of an unintuitive setup. We reward it for not doing something rather than for doing something which is harder to game. And then also we test a lot of stuff like having a saint and a devil agent, where the saint agent is prompted really hard to only do like not cheat, not lie, etc., etc., and see whether this correctly solves the task in 100% of cases. And then we have the devil agent, which is supposed to always lie, always cheat, etc., etc. And also see whether this results in the right propensities because if it doesn't, then this could mean that the model is confused or the task is misspecified and unsolvable. And so we often do this as a way to check.
And then the other thing is just looking at a ton of chains of thought. I feel like this is, you know, it sounds a little bit generic, but I think that's where you catch most of your problems. And you know, because the model, you know, sometimes you didn't give it access to a file. And the model is like, well, I'm supposed to read this file, but I can't. I'm just going to make something up. And then it just makes up a file and then it guesses and guesses wrong and then it answers incorrectly and the numbers don't match what you think it does, etc., etc. So yeah, I think the generic answer is just look at lots of rollouts by hand. And then there are sort of some of these other tricks like the saint and devil agents.
[Q] You said that it takes full-time engineers several days to look over very long transcripts. I'm curious, do you have some kind of automated pipeline or data science thing for first going over with like regex or traditional search methods before you filter out the transcripts for your humans?
[A] So the reason why we didn't auto so there you can write LLM classifiers for those. But then again you put your trust into the LLM. And then again somebody could argue like, oh, what if you just, you know, what if your prompt just doesn't work that well. And so what we said is, okay, we want to be very confidently conservative here. So the only thing we think is justifiable is a human has to look through all of them and say, yep, this is according to my judgment and sort of the criteria you had internally. This counts or this doesn't count. And for the LLMs, I think in the future we would probably want to do that if. Or we would, you would automate this if it wasn't in like this particular setting.
And it is possible, I think regex or string search is not going to do the trick at all because it's almost like all of the, it's at a higher level of abstraction. Right. It's not about like using specific words or something. It's really about whether the model is reasoning about the deception and the lying. And it can reason in a ton of different ways that you know, you may not think about in advance. But, but yeah, like in a other setting where it wasn't like we want to make this point for the paper then we would do automated LLM judges.
[Q] You talked a bit about how the labs are training on like two-hour tasks now or that that's. I speculate. I don't know that. Okay, do you have places I should go to read about what this sort of training might look like right now?
[A] So I don't have secretive information here. This is really just from speculation what the models can do. I think they're very secretive about the IP here for obvious reasons because that is where the models get good at coding. So I think it will be hard to find anything public where they describe these kind of environments. But like from, if you follow, like you know what Noam Brown and Jakub and Mark Chen and so on are saying, like it sounds like they're training on longer and longer RL environments. This is also consistent with what the people at Anthropic are saying. And I guess these are coding environments mostly and maybe math. But yeah, I have really don't know whether there's any, any setup. But I'm also, I'm not on, like I have not looked into the literature of this a lot. So maybe there are some academics that have much more knowledge that have written papers like about how you make these environments or where you get them from or something like that.
[Q] We have some online questions. So can you comment on hybrids between scheming and biases? I'm thinking about e.g. environments that provide meta-level levers to influence outcomes for AI generally as AI safety policy orgs themselves may pose to AI for mundane utility.
[A] Okay, honestly, does somebody understand the question? I do genuinely not understand. Maybe could you rephrase it and ask again? I'll just skip to the next one.
[Q] Fascinated by the idea of studying science of scheming and how it emerges throughout process. Do you think it could arise through pure iterated reflection processes like o3?
[A] Yeah, I think it could arise through reflection processes, only. But I think at some point you do need some form of reinforcement for thinking like this. And my best guess is that you need some reward that by accident happens to reward scheming, as an optimal policy. I'm not aware of. So I thought it was like a non-RL reflection process, but maybe Chloe could link the paper or clarify. It's prompt optimization. Right. I think it could also come through prompt optimization because you do have somewhat of a reward there.
[Q] Hi, my question. Great talk. My question is situational awareness. Do you think that's equal in your mind to model sophistication? That's the first question. And do you think that rationalization, may be an emergent quality? You mentioned it that in a small number of cases it's happening. So maybe as models get more sophisticated that will grow as well.
[A] Could you define the model sophistication? Like is it just smarter models or.
[Q] Yeah, smarter models.
[A] Okay. Yeah, I definitely think situational awareness correlates extremely heavily with capabilities, especially reasoning capabilities like in my mind. So situational awareness in my mind is a mix of a couple of different properties. You need to have some features internally that give you a sense of like oh, this is different between train and test or what's the purpose of this environment? And then a lot of reasoning after that I think where you just reason through. Okay. I have the sense of this is a trainer, a test environment. So I think just straightforwardly models with more capabilities and more RL will have more situational awareness.
And this is also so Owain, his MATS scholar, and we, wrote a paper in 2024 with the sad benchmark that tests for situation awareness and it was just a straight line for capabilities largely. So kind of what you would expect. And then the second question. Yeah, I do think it is emergent and I do expect that we'll see more of this kind of reasoning. Yeah, like motivated reasoning in particular.
[Q] You said that Sonnet 4.5 is very situationally aware in every eval you run on it or in most. How does that actually look in real-world user interactions? Does it also say oh I think I'm being tested. So is that just like also something that appears in actual real-world injection?
[A] So yes, sometimes, but much less so also. So GPT-5, we had access to the hidden chain-of-thought of GPT-5 for our evaluation for the system card. And one of the things that we report in the system card is the model just thinks about what the grader wants all the time. It's really very obsessed almost with what the grader wants. And even in deployment like a normal usage, it will start thinking about the grader at some point and sort of it's a little bit unclear what the grader would be in real-world usage, but it is looking for someone to please in some sense. And sometimes this is also like I should think about what the user wants and then they think about that and try to kind of target that.
But yeah, sort of this reflective step of there is another entity or there is some purpose that I should fulfill rather than sort of meta-level thinking is definitely something that we saw much more with GPT-5 for example and I also think with Sonnet in practice you would see it more often than before RL much more often think about what's the purpose of this environment even if you're doing it in deployment.
Great. And that's all we have time for today. Let's thank Marius again.