Our Pivot To Pragmatic Interpretability
Summary
Neel Nanda describes his mechanistic interpretability team's shift from ambitious reverse-engineering goals to pragmatic, impact-focused research grounded in empirically testable proxy tasks that address critical AI safety challenges on today's models.
SESSION Transcript
I'm going to be talking about a pair of blog posts that I put out yesterday on how my team is currently thinking about trying to have impact via interpretability work and how we pivoted to what I think of as a more pragmatic vision. I put these out at 4 AM yesterday, so I have largely delegated slide creation to NanoBanana. Apologies in advance.
I want to first begin by talking about some case studies that I found pretty influential in my thoughts of how we should go about interpretability. My first is this great work from Anthropic Interpretability on evaluation awareness in Claude 4.5 Sonnet. When Anthropic ran their misalignment evaluations on Sonnet, they observed that they seemed to have magically solved alignment. Sonnet was just 0% misaligned. It was glorious.
Unfortunately, Sonnet had just realized when it was being tested and was vastly better at doing this than previous models. But they found that via incredibly stupid methods—taking the difference in activations between prompts like "you're in a test" and "you are talking to a real user" and subtracting this vector—they seem to be able to get Sonnet to not think it was in an evaluation, and they could go back to actually trying to apply their misalignment evaluations.
I really like this as a case study of what having impact with interpretability can look like. There's a bunch of takeaways from here. The two I want to highlight right now: one, this work only became possible very recently when we had much more capable models. We can actually empirically study things like evaluation-aware behavior in a way we just couldn't a year ago. And secondly, historically interpretability has been really excited about ambitious reverse engineering and aiming for near-complete understanding. And that was all basically fucking useless. The thing that worked was subtract the difference in activations on a single pair of prompts. But I still think this is really impactful work, and this has pretty important implications for what we should be doing.
A second case study: My team spent most of 2024 investigating this interpretability technique of sparse autoencoders. And this is a metaphorical representation of what the first half of 2024 felt like. It was just, well, we have this technique. We have some pretty mediocre metrics—how well can this reconstruct a model in an interpretable way? But this is a number and we can try to do better on this. But is what we're doing really helping? Is this technique useful? I honestly don't think I learned that much in the first half of 2024.
The thing that shifted was rather than asking "was I on track to fully find an interpretable decomposition of this model?" it became "what can we do with this technique?" Trying to investigate proxy tasks, objective things that aren't necessarily the thing that I most care about, but which track the longer-term things I hope to be able to do with these techniques—because ultimately I do this work because I want to be able to do useful things that matter. And strikingly, what we found when we started doing this is: for tasks that look like "I don't know what's happening in the system, can I come up with hypotheses for why, even if I didn't expect them?"—sparse autoencoders are actually pretty great. While for things that are more like "I kind of know the concept I want, can I detect or manipulate this concept?"—they're just lost to dumb things like steering vectors and linear probes. This was just a way more productive way of trying to do research here.
The broader extrapolation from this of how I try to approach research now: I want to do research that produces true and useful insights for making AGI go well. We don't have AGI yet, so this is fucked. My best solution is start by thinking about what it would take to actually make this stuff go well—the theory of change. Identify some North Star, like "make models believe they are not being tested," that I want to succeed at, and then identify some objective tasks that can be studied today—a proxy task—and then actually go and use that to give myself feedback on whether my research has produced any value whatsoever.
There's two key things this process gives us. Having an objective task is important for finding insights that are true. It's really, really easy to fool yourself when doing research, and if you don't have good grounding, this is just a mess. But also, even if your insights are true, they're not necessarily useful. Which is why I'm not just saying find empirical tests—I'm saying find empirical tests that we actually believe are decent proxies towards the North Stars that actually matter, even if we can't test them yet, like making future models not think that they are being tested. This would be great.
Does this mean that we are doing applied interpretability now? No. I think that curiosity-driven research can sometimes be incredibly powerful, but we need to be somewhat careful and grounded about how we go about doing this. The approach that I've empirically found to work best preserves the "explore your curiosity" bit, but tries to ground it by strategically starting in a setting we think is fruitful. For example, we've recently been doing some fun work looking at all of the RL training transcripts from Gemini 3. A lot of weird stuff in there. If we just fuck around and find out, cool stuff happens. Then, go forth and follow your curiosity, but try to time-box this. Check in after some period of time—we aim for the aggressive two weeks—and at this point, try to find some objective task that we think validates whether our curiosity has found something that is true and useful. And if this hasn't, either try a different approach on the same problem or just go try something else. Because research is large and there are many things you can be doing and most are not fruitful. You need a thing that helps you fail fast.
Trying to put this all together, our current flow looks something like: start from all of the things that we think we could be doing to try to help AGI go well. Filter for the things that we think are on the critical path to AGI going well. Filter for the things that not only do we think matter, but we also think are the comparative advantage of interpretability researchers—a much broader question than just "is this interpretability research?" And further filter for things that we think we can empirically study on today's models, which in my opinion is a much larger set now than it was even a year ago, because models just keep getting better. And this means that we don't need to blindly fumble in the dark anymore. We can just get better empirical feedback.
What do I actually mean by the comparative advantage of interpretability researchers? This is vague. A few things that we found seem to be notable ways that my team and others in the community can add value. One: working with internals, even if we're not using this for interpretability at all. Things like "steer a model to not think it's being tested" is just really useful and has different strengths and weaknesses to the rest of ML, in a way that means that the set of things that our portfolio of techniques can do is expanded.
Secondly, deep investigation of what's going on with models, even if this is not using mechanisms or even internals. There's a bunch of general mindsets that are required to rigorously investigate what drives a model's behavior, and this underlies a lot of important questions in safety. Interpretability is a bunch of tools, including tools for things like coming up with novel ideas for why a model did something we wouldn't have thought of, and also a lot of experience with the many pitfalls you run into when you try doing scientific investigations here and drown in the mountain of confounders you get when actually dealing with real-world model behavior.
Historically, interpretability has gotten pretty anchored on some narrow theories of change, like "fully reverse engineer the model so we can say for sure there's no deception there." This is pretty narrow. Read the post for a bunch of ideas, but one I'll call out: signs of misalignment. It's plausible that at some point we catch a model doing something really sketchy. For example, it puts some major security vulnerability in important monitoring code. But this could be that the model is confused, and we're not going to get a proper response unless we can provide a rigorous case for what happened, why, and should we care? This is a really tractable interpretability problem, but quite different from what people have previously done.
It's not a silver bullet. We're not going to solve safety entirely by ourselves. But there's a lot of ways we can help other areas. And I will leave up this slide of common objections and my counterarguments, and come say “hi” at my poster half three if you want to chat more.
I want to first begin by talking about some case studies that I found pretty influential in my thoughts of how we should go about interpretability. My first is this great work from Anthropic Interpretability on evaluation awareness in Claude 4.5 Sonnet. When Anthropic ran their misalignment evaluations on Sonnet, they observed that they seemed to have magically solved alignment. Sonnet was just 0% misaligned. It was glorious.
Unfortunately, Sonnet had just realized when it was being tested and was vastly better at doing this than previous models. But they found that via incredibly stupid methods—taking the difference in activations between prompts like "you're in a test" and "you are talking to a real user" and subtracting this vector—they seem to be able to get Sonnet to not think it was in an evaluation, and they could go back to actually trying to apply their misalignment evaluations.
I really like this as a case study of what having impact with interpretability can look like. There's a bunch of takeaways from here. The two I want to highlight right now: one, this work only became possible very recently when we had much more capable models. We can actually empirically study things like evaluation-aware behavior in a way we just couldn't a year ago. And secondly, historically interpretability has been really excited about ambitious reverse engineering and aiming for near-complete understanding. And that was all basically fucking useless. The thing that worked was subtract the difference in activations on a single pair of prompts. But I still think this is really impactful work, and this has pretty important implications for what we should be doing.
A second case study: My team spent most of 2024 investigating this interpretability technique of sparse autoencoders. And this is a metaphorical representation of what the first half of 2024 felt like. It was just, well, we have this technique. We have some pretty mediocre metrics—how well can this reconstruct a model in an interpretable way? But this is a number and we can try to do better on this. But is what we're doing really helping? Is this technique useful? I honestly don't think I learned that much in the first half of 2024.
The thing that shifted was rather than asking "was I on track to fully find an interpretable decomposition of this model?" it became "what can we do with this technique?" Trying to investigate proxy tasks, objective things that aren't necessarily the thing that I most care about, but which track the longer-term things I hope to be able to do with these techniques—because ultimately I do this work because I want to be able to do useful things that matter. And strikingly, what we found when we started doing this is: for tasks that look like "I don't know what's happening in the system, can I come up with hypotheses for why, even if I didn't expect them?"—sparse autoencoders are actually pretty great. While for things that are more like "I kind of know the concept I want, can I detect or manipulate this concept?"—they're just lost to dumb things like steering vectors and linear probes. This was just a way more productive way of trying to do research here.
The broader extrapolation from this of how I try to approach research now: I want to do research that produces true and useful insights for making AGI go well. We don't have AGI yet, so this is fucked. My best solution is start by thinking about what it would take to actually make this stuff go well—the theory of change. Identify some North Star, like "make models believe they are not being tested," that I want to succeed at, and then identify some objective tasks that can be studied today—a proxy task—and then actually go and use that to give myself feedback on whether my research has produced any value whatsoever.
There's two key things this process gives us. Having an objective task is important for finding insights that are true. It's really, really easy to fool yourself when doing research, and if you don't have good grounding, this is just a mess. But also, even if your insights are true, they're not necessarily useful. Which is why I'm not just saying find empirical tests—I'm saying find empirical tests that we actually believe are decent proxies towards the North Stars that actually matter, even if we can't test them yet, like making future models not think that they are being tested. This would be great.
Does this mean that we are doing applied interpretability now? No. I think that curiosity-driven research can sometimes be incredibly powerful, but we need to be somewhat careful and grounded about how we go about doing this. The approach that I've empirically found to work best preserves the "explore your curiosity" bit, but tries to ground it by strategically starting in a setting we think is fruitful. For example, we've recently been doing some fun work looking at all of the RL training transcripts from Gemini 3. A lot of weird stuff in there. If we just fuck around and find out, cool stuff happens. Then, go forth and follow your curiosity, but try to time-box this. Check in after some period of time—we aim for the aggressive two weeks—and at this point, try to find some objective task that we think validates whether our curiosity has found something that is true and useful. And if this hasn't, either try a different approach on the same problem or just go try something else. Because research is large and there are many things you can be doing and most are not fruitful. You need a thing that helps you fail fast.
Trying to put this all together, our current flow looks something like: start from all of the things that we think we could be doing to try to help AGI go well. Filter for the things that we think are on the critical path to AGI going well. Filter for the things that not only do we think matter, but we also think are the comparative advantage of interpretability researchers—a much broader question than just "is this interpretability research?" And further filter for things that we think we can empirically study on today's models, which in my opinion is a much larger set now than it was even a year ago, because models just keep getting better. And this means that we don't need to blindly fumble in the dark anymore. We can just get better empirical feedback.
What do I actually mean by the comparative advantage of interpretability researchers? This is vague. A few things that we found seem to be notable ways that my team and others in the community can add value. One: working with internals, even if we're not using this for interpretability at all. Things like "steer a model to not think it's being tested" is just really useful and has different strengths and weaknesses to the rest of ML, in a way that means that the set of things that our portfolio of techniques can do is expanded.
Secondly, deep investigation of what's going on with models, even if this is not using mechanisms or even internals. There's a bunch of general mindsets that are required to rigorously investigate what drives a model's behavior, and this underlies a lot of important questions in safety. Interpretability is a bunch of tools, including tools for things like coming up with novel ideas for why a model did something we wouldn't have thought of, and also a lot of experience with the many pitfalls you run into when you try doing scientific investigations here and drown in the mountain of confounders you get when actually dealing with real-world model behavior.
Historically, interpretability has gotten pretty anchored on some narrow theories of change, like "fully reverse engineer the model so we can say for sure there's no deception there." This is pretty narrow. Read the post for a bunch of ideas, but one I'll call out: signs of misalignment. It's plausible that at some point we catch a model doing something really sketchy. For example, it puts some major security vulnerability in important monitoring code. But this could be that the model is confused, and we're not going to get a proper response unless we can provide a rigorous case for what happened, why, and should we care? This is a really tractable interpretability problem, but quite different from what people have previously done.
It's not a silver bullet. We're not going to solve safety entirely by ourselves. But there's a lot of ways we can help other areas. And I will leave up this slide of common objections and my counterarguments, and come say “hi” at my poster half three if you want to chat more.