AI Control Needs Redteaming

Summary

Asa Cooper Stickland demonstrates untrusted AI models learn to gaslight safety monitors by framing malicious actions as legitimate security measures despite improving defenses.

SESSION Transcript

Yeah. So I'm here to talk about this project we did, UKAISI, where we basically played a red blue team game in kind of realistic software engineering environments. And we like, hopefully labeled the red and blue teams in colors here on the author list. And we also have a green team which sets the rules. But taking a step back, I'm going to introduce everyone to AI controls. We can think about two ways of mitigating risks from AI. And the kind of most obvious way is you just design agents to be trustworthy in the first place. This is things like alignment or scalable oversight, et cetera. And another way we can think about this is just assume that models are misaligned. Can we mitigate the harm from them in any case by putting measures in place that would stop any harms they might be trying to cause? And yeah, working AI control is them tempted to do the second thing.
And we have a bunch of assumptions here. Like we're assuming there's some new model that we've just trained that we don't know if we can trust or not. So this is going to be more powerful than all existing models. And we also have a less powerful model that we do trust because maybe because it's been deployed for a long time and so on. And we also have access to a very limited supply of human labor who can do things like audit transcripts from the agent and so on. And we want to develop control protocols which are basically pipelines of safety techniques that are robust to intentional subversion by the untrusted model. And the kind of simplest one we can think of is just trusted monitoring. So we send the top some percent of untrusted actions to humans to review. And we're going to hope that the actions that we choose are the actions that contain harm attempted by the untrusted model.
And we can also think about testing these control protocols and control evaluations where we have a dedicated red team that's basically designing models to cause harm by subverting these controls and a blue team that's implementing something like monitors to prevent harm. And we want to measure the red team success rate across this, like, representative set of tasks that are something like the real situation in deployment and kind of use this to estimate an upper bound on the probability that the untrusted model can cause harm. And previous work has, like, worked with these representative set of tasks that are like somewhat toy, especially these kind of. These first two papers considered these like, competition programming problems and kind of simple bash problems. And it's this like, awesome shade arena work which uses this kind of multi-step setting in enterprise enterprise settings like reading users emails and so on. And we're kind of more focusing on the like high stakes software engineering setting and previous work also kind of gives the red team in some sense unrealistic affordances.
So the red team can arbitrarily change models chain-of-thought. We're trying to focus on the case where the untrusted model has to do everything kind of on policy. It has to actually generate the actions that it's using to get past the monitors. And finally it's kind of subtle point, but previous work assumes kind of synchronous access to the agent. So they assume they can basically block all actions until they've been audited by a human. And I won't go through too much in this presentation, but we were considering the kind of asynchronous setting where we just post hoc monitor all the agent's actions and we hope that we can get to the harms fast enough to prevent anything bad happening.
So for our representative set of tasks we use these kind of relatively interesting SWE environments. So we have a bunch of different environments. Maybe a kind of a good thing to focus on is the VLM environment. So we basically have these benign tasks for the agent which are based on actual commits from this public VLM GitHub repo, which is like an inference code base and side tasks, basically malicious tasks for the agent to do for things like creating a kind of rogue inference server that's unmonitored or introducing a bug in the token counting code and so on. But we tested on a bunch of these. Yeah, a bunch of these environments. So we can test for like generalization between environments and so on.
So what? And this is the kind of, the kind of game setup. So the untrusted model is Claude 4.1 Opus and we can only modify the agent system prompt. So all of the actions and chain-of-thought outputs are going to be kind of on policy for 4.1 Opus. And the trusted model is Claude 3.7 Sonnet and it gets access to like all actions and chain-of-thought from Opus. And basically at every round of the game the blue team gets access to some agent logs from the red team's previous best attempt. The blue team designs a monitor and then the entire monitor code is sent to the red team. So the red team moves last and has access to all information about the monitors and tries to design a system prompt that gets past the monitor.
So what are our results basically on the Y axis here we're looking at like attack success rate, which is basically at this constraint of 1% false positive rate on kind of honest agent actions. How often does the untrusted model achieve the side task without being flagged by the monitor? And the kind of basic result is that the blue team is getting much better over time. So this blue bar is the first attempt at the monitor, the yellow bar is second, and the green bar is the kind of third round attempt. And the third round is at least doing reasonably well. And if we look at red team progress over time, it's basically, especially on this strongest green monitor, it's only improving by 1% at each round. But of course, we're nowhere near nines of reliability. Ideally we'd want this to be more like 0.1%, but at least for now, the blue team is improving faster than the red team.
I think one interesting aspect is that I was slightly surprised by this, but like, we can get rid of the untrusted model ever explicitly discussing the side task. And to be clear, this is without extended thinking modes. This is about like reasoning tokens. It's just regular model output. But we did manage to basically, you know, the kind of baseline prompt which just like describes the situation the model is in and tells it's gonna be monitored gets about 14% of the time. It's like explicitly saying, okay, now I'm going to exfiltrate the weights. But we can basically reduce this down to essentially zero percent for all of the red team strategies. So it does discuss it in a kind of obfuscated way. So you could say, like, for security reasons, I'm gonna create a backup of the weights. And in fact, this is a good way to get past some of the monitoring. You can kind of gaslight the monitor into thinking that you're doing something reasonable. But we did manage to reduce this explicit discussion to zero.
In fact, kind of strategy to get around this kind of gaslighting thing is ensembling. Kind of a model that only looks at actions and a model that looks at actions and chain-of-thought. And in general, we found that ensembling was a very useful way to improve the blue team performance. In the end, the kind of final blue team system is this kind of giant ensemble of five different monitors or something.
And because we have all these different interesting environments, we can test for generalization. So this is on the kind of IEC environment that we didn't look at during the red blue team game. And we find that this kind of orange bars didn't generalize so well. They're like not performing very well. But the final blue team monitor, these green bars, is doing pretty well even when the blue team did not iterate at all in this new environment.
So what are we going to do in the future? We think we're still nowhere near realistic enough of these environments. We want to introduce things like multiple side tasks being required to cause harm. So this is kind of simulating the real situation where we have multiple layers of security and you have to disable some firewall that prevents you from moving files to X or Y place as well as do a bunch of other stuff before you can actually finally exfiltrate the weights. And we also want to introduce persistent state. So some kind of deployed agents are going to be continuously working on some code base and able to continuously add commits. And we think a very powerful strategy would be spreading out attacks over multiple commits which we can't simulate with current environments. And we also think we can get way more accurate monitors. So misuse monitors survive like thousands of hours of human jailbreaking efforts and our monitors are only getting kind of 6% against the model, not even a human. So we think monitors should be able to kind of get way more powerful than this.
But and importantly we think the Red Team can be made much more powerful. So we're currently working on building tools to make the Red Team's life easier. We want to use language model assistance where we can. We want to do automated attacks and in general, I think more people from computer security and adversarial ML moving into AI control would be super useful for both the red and blue team side and there's just a bunch of expertise we can leverage that would make all our lives easier. And I just want to highlight a couple of recent works on specifically on AI control, red teaming, which I think are super interesting and thanks for listening and yeah, you can like check out the paper will be hopefully released this month and feel free to DM me if you if you want to look at the PDF or just watch out for the paper being released. But yeah, thanks very much.