San Diego Alignment Workshop 2025 Opening Remarks
Summary
Adam Gleave argues that with frontier AI developers now spanning multiple nations, safety researchers must create techniques so effective they become competitive advantages, because we're only as safe as the least safe model.
SESSION Transcript
Wow, okay. Everyone is so aligned, they just got up and people actually stopped talking. I think this might be unprecedented actually. So it is my pleasure to welcome all of you to the San Diego Alignment Workshop. And I wanted to start just by reflecting on the first Alignment Workshop that took place two and a half years ago, just a few months after release of ChatGPT, which started to make for broader world aware of those sort of power and also risks of AI systems. And since that point there's been a real explosion of interest in alignment and also the Alignment Workshop. So you might notice that we're in a somewhat bigger room today than we were two and a half years ago. And in fact this is our largest event yet with over 300 registered attendees and many of you will be joining us for the first time. In which case welcome both to this event and to this community. Whether old or new, I'm really excited to have you with us today because we're at one of the most important times to be working on AI alignment.
Since we're almost at the end of the year, I wanted to take stock of some of the key trends we've seen this year across both capabilities and risks, before concluding with what we have to look forward to today and tomorrow at this workshop for capabilities. I want to start by just observing some qualitative changes, changes that emerged in 2025. And for me at least, even though I knew about all of these capability changes, it was still pretty shocking to me to see, oh, this only happened in the last 12 months. I thought it'd been longer. And I think probably the paradigmatic example of that is reasoning models that are just so integral to large language model workflows today that I kind of just took it for granted. But of course LLMs do reasoning, but really this is a pretty recent phenomena. So people have been prompting LLMs to generate chain-of-thought transcripts since early 2022, with Jason Wei and others pioneering this technique. But actually doing extensive post-training to improve the quality of this reasoning only emerged in September 2024 with o1 preview and this was followed rapidly by Google Anthropic developing their own variants of these reasoning models and of course the first open-weights reasoning model, DeepSeek R1.
So reasoning models have only really been widely available and used for around a year. Another big trend that is easy to take for granted is the use of coding agents. Now again, people have been trying to make AI systems autonomously perform coding tasks for a while now. One of the earlier examples was Devin that was in private beta in March 2024 and solved around 14% of tasks in software engineering or SWE-bench, which is impressive at the time but not really enough to be useful. And in December 2024, so just 12 months ago we started seeing a flurry of AI agentic coders being released and I think things really started turning a quarter with Claude code 3.7 Sonnet that smashed the 50% rate in SWE-bench.
And just within the last week we saw Gemini 3 Pro and Claude 4.5 Opus get released, both achieving 74% on this benchmark. Now we all know benchmarks can be misleading, but obviously just from personal experience, all of our team are using coding agents, including many people who are not on the technical team, who find that they can now make simple project websites, basically just with Vibe coding. And this is very, very different to a year ago.
And then another key takeaway for me on a more strategic level is that the number of developers who are at or near the frontier is growing. And I think this really wasn't obviously going to be the case. I mean training runs are costing more and more money, like hundreds of million dollars per training run. And there's a growing amount of trade secrets. A few years ago, basically everything about Transformers and pre-training is just on arXiv, right? So it might be fiddly to get these things to work, but the core things are just out there. But now there's increasing reliance on post-training novel architectures, various kinds of mixture of experts, and we don't even know the exact parameter count of these proprietary models. So there's a lot more secrecy. And yet despite that, we've got three major proprietary developers, Google DeepMind, OpenAI and Anthropic, who have just been sort of lapping each other throughout the year, releasing more capable models in different domains. And then there's other developers like Grok who are not that far behind in absolute terms. And then there's been this kind of proliferation of high performing open-weights models like Kimi K2, Qwen3, DeepSeek 3.2, and Llama, but while not quite competitive with the current proprietary state of art, still sort of resoundingly beat the best performing models from just a year or two ago. So they're very capable in absolute terms.
And in many ways this multipolar world has a number of benefits. It avoids power concentration. It's a big benefit having capable open-weights models for independent AI researchers, whether in academia or nonprofits like FAR AI. But we have to remember that for a lot of risks we're only as safe as the least safe model. And so especially in situations like misuse, where bad actors will gravitate to the most vulnerable model, we need to ensure that a growing number of developers across multiple nations are designing systems safely. And what this means for researchers is that first we need to be systematizing our safety approaches into best practices that can be widely adopted. And in the current absence of widespread international AI safety regulation, I think we also need to be holding ourselves to a very high standard in terms of the techniques we develop, that they should be so good that everyone would want to use them because it just makes our models better in the same way that everyone uses RLHF. There's no law that says you have to use RLHF, but you'd just be at a disadvantage if you didn't.
Now, I also want to briefly recap quantitative progress. So in short, we've continued to see really rapid progress in pretty much every benchmark we put together. And I think one of the most striking examples of this is GPQA-Diamond collection of really challenging multiple choice questions across scientific domains like chemistry and physics, but were designed to be difficult for non-experts even with full access to the Internet. And when it launched back in April 2023, models really did little better than random guessing. And then with this advent of reasoning models, by January of this year we started seeing models reach expert level human performance. And at this point I think the benchmark is being to basically get saturated with the latest models like GPT-5.1 and Gemini 3 Pro reaching 90% or above. Now we've all seen many benchmark results like this, so I'm not going to bore you by running through all of these, but I bring it up because I see people look at these graphs and take away two potentially very different conclusions. So on the one hand I think you can look at this graph and say, well, it looks like human level AI is pretty near. Maybe we're not going to solve robotics anytime soon, but at least replacing every kind of cognitive task at a remote work experience can do. What more do you want to see? AI systems are rapidly outperforming experts agents able to solve longer and longer tasks. But on the other hand, we have to remember these evaluations are full of limitations. Scientific progress isn't made by solving multiple choice questions, but that's what GPQA assesses. And software engineers, it turns out, do a lot more than just solve really well scoped programming problems. So things like SWE-bench aren't necessarily capturing that. And on a personal level, I found that AI has only been a modest boost to my productivity, which isn't really what I'd expect if AI was just on the brink of replacing me. So, overall, I find both of these views plausible. And I bring it up because I'd really love to see debate on this topic at the alignment workshop, because I think we have people who firmly believe or are unsure about these views, these views represented, and better understand what the strongest arguments for each perspective are. And ideally, we can actually identify some research, some evaluation benchmarks we could come up with that might tell us which of these worlds we're in. But until that point, I'm going to remain uncertain and try and work on things that help in both worlds.
So we've not just seen impressive capability growth and changes in qualitative trends in 2025, we've also seen a growing amount of concrete evidence on AI risks. And I bring this up both to remind us of why we're in this room, what we're working to prevent, but also to help guide us in the directions that do seem to be the most pressing risks for us to solve. So I want to start by surveying some real world safety incidents and then look at what we've learned about longer term risks back in 2025.
Now, there's so many different risks, but it's easy to talk past one another. So I find it helpful to break risks down into a few different categories. And apologies for people who've seen me speak before, because you're probably bored to death with this slide. But very quickly. Misuse is when a bad actor is using an AI to achieve bad things. Misalignment is when a system is autonomously pursuing goals that were different to those that the developer or broader society intended to put into it. And then destabilization is a situation where individually AI systems are more or less doing the right thing, but when millions of them are acting together, the aggregate effect could be harmful.
And I think a notable trend in 2025 has been that we've just seen a number of real world incidents of both misuse and misalignment. On the misuse side, we are starting to see crimes being perpetuated with the help of AI. So the FBI alleges that suspects in the bombing of an abortion clinic earlier this year used an AI chat program to help them plan this attack. Now, I suspect that these attacks would have happened without the use of AI as well, but it is a clear demonstration that terrorists are at least actively seeking to use AI. And in some ways, the early adopters of this technology. And it's not just lone actors misusing AI. Anthropic recently identified and disrupted a sophisticated threat actor that they believed to be a Chinese state sponsored group who were using Claude codes to autonomously launch cyber attacks against 30 targets, including financial institutions, government agencies and major tech companies.
And although the attacks only succeeded in a handful of cases, from a threat actors perspective, if you can launch attacks at larger scale at lower cost, you only need a handful of successes. And of course these systems are going to get more capable as time goes on at conducting these kinds of attacks. So I think we can expect to see more such attacks from a wider range of threat actors as time goes on. On the misalignment side, I think one of the most glaring examples of how alignment has failed is sycophancy. So models just say things the user is more likely to want to hear. And I think for many people this has become almost second nature of how you use AI systems. Like if you want feedback on your writing, you would never tell an AI system, can you give me feedback on my writing? It's like a colleague sent me this, what do you think about this? So these workarounds. But it's silly that we have to do that, right? And sycophancy isn't new. It was demonstrated in a controlled setting back in 2023 where a variety of models would flip their response to agree with whatever the user's position on a topic was. But I think 2025 has seen sycophancy become a problem on a large scale. So one of the most glaring instances was this GPT-4 model release where OpenAI had to roll back the model to an early checkpoint after it became just so sycophantic that it was causing widespread user facing issues. It was making the news. And sycophancy is having a real world impact beyond just boosting people's egos, right? So it's been implicated unfortunately in several tragic cases of individuals taking their lives actively encouraged by LLM chatbots.
Now, sycophancy is an example of AI deception more broadly, and we've seen that emerge in 2025 in other instances. So with the rise of coding agents, we've started seeing models cheat on tasks. And this is one example from Bowen Baker and others who found that GPT-4o coding agents during a production training run would modify code to spuriously pass tests. And with the chain-of-thought, we can see the agent's actually reasoning through why they want to hack this eventually use the word hack in their reasoning and sort of the uptake from this paper is oh well, you can use chain-of-thought monitors to detect and catch us, which is a good idea. But I think the sort of notable thing is that although there's been significant effort by AI companies to try and eliminate this phenomenon because it makes their coding agents less useful, anecdotally our engineering team still frequently encounters issues like this. Maybe not quite as egregious, but it is something we have to really monitor coding agents quite carefully.
Now I'd say looking at real world instances of misbehavior is crucial to keep us grounded, but ideally we identify and fix problems before they occur and cause real world harm. So I also want to spend a bit of time looking at what we've learned about future risks over the last 12 months. On the misuse front, I think one of the most noticeable developments has been a wide variety of models have started reaching or approaching dangerous capability thresholds for development of chemical, biological, radiological and nuclear weapons. One representative example is Anthropic found that Claude 4.5 Opus is providing almost a 2x uplift in terms of how far non-experts can get through a bioweapons relevant safety protocol over just having access to the Internet Google. And this is a pretty recent trend. Claude 4 started providing some modest uplift, but early models like Sonnet 3.7 it wasn't really statistically significant. Although the mean still seems seems higher in that case. Maybe they helped a little bit. Now I think the good news is that there are still some key capability gaps and other things that would make it hard for a bad actor to really misuse these systems. But those capability gaps are getting filled in at a very rapid pace. So I would expect models in 2026 to have many or perhaps all of the capabilities needed for large scale harm here. And the good news is that developers including OpenAI and Anthropic have responded to this. They haven't just completely ignored their own evaluation responses, at least not all developers, and they are deploying some additional safeguards. And these safeguards do make attacks meaningfully harder. And I think there's some sort of misunderstanding around this because these safeguards are very narrowly targeted. So it's still trivial to jailbreak models in most domains, but it is getting harder in some of the most sensitive domains. But the models still aren't robust to determine actors. So we did find universal jailbreaks with 100% attack success rate across both Opus 4 and GPT-5 within a week or two of us starting testing and they generated capable responses even to very blatantly harmful questions like the one reproduced on the right.
Now, on misalignment, we saw this time last year research from Meinke and others at Apollo Research finding that models, when instructed to achieve their goals no matter what, were pretty capable of doing that. So they had reasoning that noticed, oh, I'm about to be replaced by another AI that's bad for achieving my goals. If I don't exist, I can't achieve my goals. So I should overwrite the model weights of this new upstart AI. And then when a user detects that that's what's going on, they'll pretend to be this new AI to try and avoid detection.
Now, that's pretty easy to catch because it's still visible in the chain-of-thought. It's also a bit unrealistic because the models were prompted to do this right? But more recently, just a few months ago, research by Aengus Lynch and others found that again, a wide variety of models will engage in blackmail behavior when in simulated scenarios or objectives such as being shut down. The models aren't prompted to do this now. They're just seeing income emails, which they might see if you connect them to your inbox, that indicate that they might be at risk of being shut down.
And I think another sort of really fascinating thread, but again, it's only emerged this year, is work by Owain Evans and others on introducing this phenomenon of emergent misalignment where you train a system and perhaps inadvertently it becomes misaligned in some area. You know, you messed up your training process. I mean, it generalizes just being misaligned in a whole host of other tasks. And I think one of the most compelling examples I've seen from this was just from a couple of weeks ago work by Monte MacDiarmid and others at Anthropic who showed that a model trained on production coding environments learns to reward hack. Okay, so far, not that surprising. We've seen that before. And then it generalizes to to be misaligned in other ways, including when used as a coding agent, engaging in sabotage to make it easier for it or other agents to reward hack in the future. So it's actively looking to undermine your alignment research efforts. And I think this is really significant because we could end up in a situation where coding agents are widely used to accelerate AI research and development, but you kind of can't trust them to help you with alignment work. But maybe you can still trust them for AI capabilities because that's easier to measure. We've got clever benchmarks. So it could be this situation where we're really differentially accelerating capabilities over alignment, which would be a concerning situation to be in.
So put an end to this cheery talk. Problems I expect to see in 2026 would be AI assisted crime really going mainstream. For cybersecurity, I think the right approach here is basically defensive acceleration. I don't think we can stop abuse of AI systems to launch attacks. Maybe we can slow it down a little bit, but the name of the game is going to be finding and fixing vulnerabilities in software, putting legacy codes and memory safe languages and support, starting to formally verify some key components of software which we've been able to do for a long time. It's just too expensive. But AI can really lower that cost. And with CBRN I think we do need to focus more on safeguards because humans are a lot more expensive and harder to patch than software.
And I think the good news is that although I criticized the current safeguards in systems, the basic approach I think works for proprietary models. It's just sort of implementation has been a little bit sloppy. So I think that developers can and need to do better. For open-weights models, it's going to be more challenging. I think we need to make significant progress on tamper resistant safeguards. So that's a promising research problem, but pre-training filtering in particular is a promising direction. I'd love to see more people work on. And then deception is a key area that is only going to become more severe in 2026. But again, there are some tractable directions on this and I'm looking forward to my colleague Chris Cundy's talk later in this workshop on how scalable oversight can reduce deception. And I think the current state is well summed up by the International AI Safety Report chaired by Yoshua and managed by over 30 states. Although considerable progress has been made in making systems safer, we still don't have any method that can reliably prevent unsafe AI outputs, even when those are clearly egregiously wrong. Like a coding agent hacking tests, or a system providing detailed instructions for bioweapons that we trained it really hard not to do. And looking further ahead, I expect that misaligned behavior is going to become increasingly sophisticated. So as coding agents are increasingly used for AI R&D, I expect problems like alignment sabotage are being demonstrated in controlled settings to start occurring in real life un some of the companies. As agents become more capable of longer term plans, we have to start worrying about them having hidden objectives. And this could be a hidden objective to someone at the AI company, maybe the CEO, maybe somebody on the post-training team that put a backdoor in. Or it could be a hidden objective at his emerged accidentally during training. And more strategically, I think that this shift to a more multipolar world means that we can't just focus on safeguarding the very most capable models. We need to be ensuring that all models that are crossing some absolute thresholds of dangerous capability threshold might go up over time as society gets more resilient and prepared, but it's not just going to go up because someone released a more capable model. We need to ensure that all of those models are safe. And so from a research standpoint, I think we can play a key role to shaping these competitive dynamics. Do consumers, do governments have insights into a relative safety of different models and which of these problems are tractable and which of these actually impose real costs so that these competitive dynamics can shape systems to ideally be safer? I mean, a lot of people would prefer to have a slightly less capable model, but one that doesn't lie to them about whether it's completed a task, but it's not visible to people how models compare on that.
So I'd love to talk more about this, but literally the rest of the day and tomorrow are on topics like this.
Please do join me in a round of applause to thank the program committee, the organizers, support staff, and sponsors.
Since we're almost at the end of the year, I wanted to take stock of some of the key trends we've seen this year across both capabilities and risks, before concluding with what we have to look forward to today and tomorrow at this workshop for capabilities. I want to start by just observing some qualitative changes, changes that emerged in 2025. And for me at least, even though I knew about all of these capability changes, it was still pretty shocking to me to see, oh, this only happened in the last 12 months. I thought it'd been longer. And I think probably the paradigmatic example of that is reasoning models that are just so integral to large language model workflows today that I kind of just took it for granted. But of course LLMs do reasoning, but really this is a pretty recent phenomena. So people have been prompting LLMs to generate chain-of-thought transcripts since early 2022, with Jason Wei and others pioneering this technique. But actually doing extensive post-training to improve the quality of this reasoning only emerged in September 2024 with o1 preview and this was followed rapidly by Google Anthropic developing their own variants of these reasoning models and of course the first open-weights reasoning model, DeepSeek R1.
So reasoning models have only really been widely available and used for around a year. Another big trend that is easy to take for granted is the use of coding agents. Now again, people have been trying to make AI systems autonomously perform coding tasks for a while now. One of the earlier examples was Devin that was in private beta in March 2024 and solved around 14% of tasks in software engineering or SWE-bench, which is impressive at the time but not really enough to be useful. And in December 2024, so just 12 months ago we started seeing a flurry of AI agentic coders being released and I think things really started turning a quarter with Claude code 3.7 Sonnet that smashed the 50% rate in SWE-bench.
And just within the last week we saw Gemini 3 Pro and Claude 4.5 Opus get released, both achieving 74% on this benchmark. Now we all know benchmarks can be misleading, but obviously just from personal experience, all of our team are using coding agents, including many people who are not on the technical team, who find that they can now make simple project websites, basically just with Vibe coding. And this is very, very different to a year ago.
And then another key takeaway for me on a more strategic level is that the number of developers who are at or near the frontier is growing. And I think this really wasn't obviously going to be the case. I mean training runs are costing more and more money, like hundreds of million dollars per training run. And there's a growing amount of trade secrets. A few years ago, basically everything about Transformers and pre-training is just on arXiv, right? So it might be fiddly to get these things to work, but the core things are just out there. But now there's increasing reliance on post-training novel architectures, various kinds of mixture of experts, and we don't even know the exact parameter count of these proprietary models. So there's a lot more secrecy. And yet despite that, we've got three major proprietary developers, Google DeepMind, OpenAI and Anthropic, who have just been sort of lapping each other throughout the year, releasing more capable models in different domains. And then there's other developers like Grok who are not that far behind in absolute terms. And then there's been this kind of proliferation of high performing open-weights models like Kimi K2, Qwen3, DeepSeek 3.2, and Llama, but while not quite competitive with the current proprietary state of art, still sort of resoundingly beat the best performing models from just a year or two ago. So they're very capable in absolute terms.
And in many ways this multipolar world has a number of benefits. It avoids power concentration. It's a big benefit having capable open-weights models for independent AI researchers, whether in academia or nonprofits like FAR AI. But we have to remember that for a lot of risks we're only as safe as the least safe model. And so especially in situations like misuse, where bad actors will gravitate to the most vulnerable model, we need to ensure that a growing number of developers across multiple nations are designing systems safely. And what this means for researchers is that first we need to be systematizing our safety approaches into best practices that can be widely adopted. And in the current absence of widespread international AI safety regulation, I think we also need to be holding ourselves to a very high standard in terms of the techniques we develop, that they should be so good that everyone would want to use them because it just makes our models better in the same way that everyone uses RLHF. There's no law that says you have to use RLHF, but you'd just be at a disadvantage if you didn't.
Now, I also want to briefly recap quantitative progress. So in short, we've continued to see really rapid progress in pretty much every benchmark we put together. And I think one of the most striking examples of this is GPQA-Diamond collection of really challenging multiple choice questions across scientific domains like chemistry and physics, but were designed to be difficult for non-experts even with full access to the Internet. And when it launched back in April 2023, models really did little better than random guessing. And then with this advent of reasoning models, by January of this year we started seeing models reach expert level human performance. And at this point I think the benchmark is being to basically get saturated with the latest models like GPT-5.1 and Gemini 3 Pro reaching 90% or above. Now we've all seen many benchmark results like this, so I'm not going to bore you by running through all of these, but I bring it up because I see people look at these graphs and take away two potentially very different conclusions. So on the one hand I think you can look at this graph and say, well, it looks like human level AI is pretty near. Maybe we're not going to solve robotics anytime soon, but at least replacing every kind of cognitive task at a remote work experience can do. What more do you want to see? AI systems are rapidly outperforming experts agents able to solve longer and longer tasks. But on the other hand, we have to remember these evaluations are full of limitations. Scientific progress isn't made by solving multiple choice questions, but that's what GPQA assesses. And software engineers, it turns out, do a lot more than just solve really well scoped programming problems. So things like SWE-bench aren't necessarily capturing that. And on a personal level, I found that AI has only been a modest boost to my productivity, which isn't really what I'd expect if AI was just on the brink of replacing me. So, overall, I find both of these views plausible. And I bring it up because I'd really love to see debate on this topic at the alignment workshop, because I think we have people who firmly believe or are unsure about these views, these views represented, and better understand what the strongest arguments for each perspective are. And ideally, we can actually identify some research, some evaluation benchmarks we could come up with that might tell us which of these worlds we're in. But until that point, I'm going to remain uncertain and try and work on things that help in both worlds.
So we've not just seen impressive capability growth and changes in qualitative trends in 2025, we've also seen a growing amount of concrete evidence on AI risks. And I bring this up both to remind us of why we're in this room, what we're working to prevent, but also to help guide us in the directions that do seem to be the most pressing risks for us to solve. So I want to start by surveying some real world safety incidents and then look at what we've learned about longer term risks back in 2025.
Now, there's so many different risks, but it's easy to talk past one another. So I find it helpful to break risks down into a few different categories. And apologies for people who've seen me speak before, because you're probably bored to death with this slide. But very quickly. Misuse is when a bad actor is using an AI to achieve bad things. Misalignment is when a system is autonomously pursuing goals that were different to those that the developer or broader society intended to put into it. And then destabilization is a situation where individually AI systems are more or less doing the right thing, but when millions of them are acting together, the aggregate effect could be harmful.
And I think a notable trend in 2025 has been that we've just seen a number of real world incidents of both misuse and misalignment. On the misuse side, we are starting to see crimes being perpetuated with the help of AI. So the FBI alleges that suspects in the bombing of an abortion clinic earlier this year used an AI chat program to help them plan this attack. Now, I suspect that these attacks would have happened without the use of AI as well, but it is a clear demonstration that terrorists are at least actively seeking to use AI. And in some ways, the early adopters of this technology. And it's not just lone actors misusing AI. Anthropic recently identified and disrupted a sophisticated threat actor that they believed to be a Chinese state sponsored group who were using Claude codes to autonomously launch cyber attacks against 30 targets, including financial institutions, government agencies and major tech companies.
And although the attacks only succeeded in a handful of cases, from a threat actors perspective, if you can launch attacks at larger scale at lower cost, you only need a handful of successes. And of course these systems are going to get more capable as time goes on at conducting these kinds of attacks. So I think we can expect to see more such attacks from a wider range of threat actors as time goes on. On the misalignment side, I think one of the most glaring examples of how alignment has failed is sycophancy. So models just say things the user is more likely to want to hear. And I think for many people this has become almost second nature of how you use AI systems. Like if you want feedback on your writing, you would never tell an AI system, can you give me feedback on my writing? It's like a colleague sent me this, what do you think about this? So these workarounds. But it's silly that we have to do that, right? And sycophancy isn't new. It was demonstrated in a controlled setting back in 2023 where a variety of models would flip their response to agree with whatever the user's position on a topic was. But I think 2025 has seen sycophancy become a problem on a large scale. So one of the most glaring instances was this GPT-4 model release where OpenAI had to roll back the model to an early checkpoint after it became just so sycophantic that it was causing widespread user facing issues. It was making the news. And sycophancy is having a real world impact beyond just boosting people's egos, right? So it's been implicated unfortunately in several tragic cases of individuals taking their lives actively encouraged by LLM chatbots.
Now, sycophancy is an example of AI deception more broadly, and we've seen that emerge in 2025 in other instances. So with the rise of coding agents, we've started seeing models cheat on tasks. And this is one example from Bowen Baker and others who found that GPT-4o coding agents during a production training run would modify code to spuriously pass tests. And with the chain-of-thought, we can see the agent's actually reasoning through why they want to hack this eventually use the word hack in their reasoning and sort of the uptake from this paper is oh well, you can use chain-of-thought monitors to detect and catch us, which is a good idea. But I think the sort of notable thing is that although there's been significant effort by AI companies to try and eliminate this phenomenon because it makes their coding agents less useful, anecdotally our engineering team still frequently encounters issues like this. Maybe not quite as egregious, but it is something we have to really monitor coding agents quite carefully.
Now I'd say looking at real world instances of misbehavior is crucial to keep us grounded, but ideally we identify and fix problems before they occur and cause real world harm. So I also want to spend a bit of time looking at what we've learned about future risks over the last 12 months. On the misuse front, I think one of the most noticeable developments has been a wide variety of models have started reaching or approaching dangerous capability thresholds for development of chemical, biological, radiological and nuclear weapons. One representative example is Anthropic found that Claude 4.5 Opus is providing almost a 2x uplift in terms of how far non-experts can get through a bioweapons relevant safety protocol over just having access to the Internet Google. And this is a pretty recent trend. Claude 4 started providing some modest uplift, but early models like Sonnet 3.7 it wasn't really statistically significant. Although the mean still seems seems higher in that case. Maybe they helped a little bit. Now I think the good news is that there are still some key capability gaps and other things that would make it hard for a bad actor to really misuse these systems. But those capability gaps are getting filled in at a very rapid pace. So I would expect models in 2026 to have many or perhaps all of the capabilities needed for large scale harm here. And the good news is that developers including OpenAI and Anthropic have responded to this. They haven't just completely ignored their own evaluation responses, at least not all developers, and they are deploying some additional safeguards. And these safeguards do make attacks meaningfully harder. And I think there's some sort of misunderstanding around this because these safeguards are very narrowly targeted. So it's still trivial to jailbreak models in most domains, but it is getting harder in some of the most sensitive domains. But the models still aren't robust to determine actors. So we did find universal jailbreaks with 100% attack success rate across both Opus 4 and GPT-5 within a week or two of us starting testing and they generated capable responses even to very blatantly harmful questions like the one reproduced on the right.
Now, on misalignment, we saw this time last year research from Meinke and others at Apollo Research finding that models, when instructed to achieve their goals no matter what, were pretty capable of doing that. So they had reasoning that noticed, oh, I'm about to be replaced by another AI that's bad for achieving my goals. If I don't exist, I can't achieve my goals. So I should overwrite the model weights of this new upstart AI. And then when a user detects that that's what's going on, they'll pretend to be this new AI to try and avoid detection.
Now, that's pretty easy to catch because it's still visible in the chain-of-thought. It's also a bit unrealistic because the models were prompted to do this right? But more recently, just a few months ago, research by Aengus Lynch and others found that again, a wide variety of models will engage in blackmail behavior when in simulated scenarios or objectives such as being shut down. The models aren't prompted to do this now. They're just seeing income emails, which they might see if you connect them to your inbox, that indicate that they might be at risk of being shut down.
And I think another sort of really fascinating thread, but again, it's only emerged this year, is work by Owain Evans and others on introducing this phenomenon of emergent misalignment where you train a system and perhaps inadvertently it becomes misaligned in some area. You know, you messed up your training process. I mean, it generalizes just being misaligned in a whole host of other tasks. And I think one of the most compelling examples I've seen from this was just from a couple of weeks ago work by Monte MacDiarmid and others at Anthropic who showed that a model trained on production coding environments learns to reward hack. Okay, so far, not that surprising. We've seen that before. And then it generalizes to to be misaligned in other ways, including when used as a coding agent, engaging in sabotage to make it easier for it or other agents to reward hack in the future. So it's actively looking to undermine your alignment research efforts. And I think this is really significant because we could end up in a situation where coding agents are widely used to accelerate AI research and development, but you kind of can't trust them to help you with alignment work. But maybe you can still trust them for AI capabilities because that's easier to measure. We've got clever benchmarks. So it could be this situation where we're really differentially accelerating capabilities over alignment, which would be a concerning situation to be in.
So put an end to this cheery talk. Problems I expect to see in 2026 would be AI assisted crime really going mainstream. For cybersecurity, I think the right approach here is basically defensive acceleration. I don't think we can stop abuse of AI systems to launch attacks. Maybe we can slow it down a little bit, but the name of the game is going to be finding and fixing vulnerabilities in software, putting legacy codes and memory safe languages and support, starting to formally verify some key components of software which we've been able to do for a long time. It's just too expensive. But AI can really lower that cost. And with CBRN I think we do need to focus more on safeguards because humans are a lot more expensive and harder to patch than software.
And I think the good news is that although I criticized the current safeguards in systems, the basic approach I think works for proprietary models. It's just sort of implementation has been a little bit sloppy. So I think that developers can and need to do better. For open-weights models, it's going to be more challenging. I think we need to make significant progress on tamper resistant safeguards. So that's a promising research problem, but pre-training filtering in particular is a promising direction. I'd love to see more people work on. And then deception is a key area that is only going to become more severe in 2026. But again, there are some tractable directions on this and I'm looking forward to my colleague Chris Cundy's talk later in this workshop on how scalable oversight can reduce deception. And I think the current state is well summed up by the International AI Safety Report chaired by Yoshua and managed by over 30 states. Although considerable progress has been made in making systems safer, we still don't have any method that can reliably prevent unsafe AI outputs, even when those are clearly egregiously wrong. Like a coding agent hacking tests, or a system providing detailed instructions for bioweapons that we trained it really hard not to do. And looking further ahead, I expect that misaligned behavior is going to become increasingly sophisticated. So as coding agents are increasingly used for AI R&D, I expect problems like alignment sabotage are being demonstrated in controlled settings to start occurring in real life un some of the companies. As agents become more capable of longer term plans, we have to start worrying about them having hidden objectives. And this could be a hidden objective to someone at the AI company, maybe the CEO, maybe somebody on the post-training team that put a backdoor in. Or it could be a hidden objective at his emerged accidentally during training. And more strategically, I think that this shift to a more multipolar world means that we can't just focus on safeguarding the very most capable models. We need to be ensuring that all models that are crossing some absolute thresholds of dangerous capability threshold might go up over time as society gets more resilient and prepared, but it's not just going to go up because someone released a more capable model. We need to ensure that all of those models are safe. And so from a research standpoint, I think we can play a key role to shaping these competitive dynamics. Do consumers, do governments have insights into a relative safety of different models and which of these problems are tractable and which of these actually impose real costs so that these competitive dynamics can shape systems to ideally be safer? I mean, a lot of people would prefer to have a slightly less capable model, but one that doesn't lie to them about whether it's completed a task, but it's not visible to people how models compare on that.
So I'd love to talk more about this, but literally the rest of the day and tomorrow are on topics like this.
Please do join me in a round of applause to thank the program committee, the organizers, support staff, and sponsors.