Low probability estimation

Summary

Jacob Hilton presents work on low-probability estimation and introduces the "matching sampling principle," which offers a formal definition for mechanistic interpretability.

SESSION Transcript

I'll be talking about low-probability estimation, which is joint work with several of the people either currently or formerly at the Alignment Research Center. My prepared material is maybe around 30 minutes long without any breaks. So there's enough time if people want to jump in with questions, maybe wait for the mic to reach you so people listening online can hear.
There'll be a lot of overlap if anyone was at my talk at the Iliad. So I won't be offended if you decide that it's too much. If you were there and also here, I think probably a minority of people, but I won't be offended if you want to leave. Okay, so.
Okay, so I'll start by introducing the problem statement for low-probability estimation. And then I'll talk a little bit about why this problem is interesting from the perspective of white-box approaches, in other words, approaches that look inside the model internals. Then I'll touch on some language model experiments we ran last year. And then towards the end I'll go into some of the work in progress which we've been working on this year, where we've been studying this question from a more theoretical perspective.
I have some early theoretical results that I'll talk about and some kind of this idea of the matching sampling principle which kind of expresses our hope for low-probability estimation from a theoretical perspective.
Okay, so what is low-probability estimation? So the basic idea is we have some model, we input some documents, maybe we let it browse the web and we're trying to avoid some kind of rare catastrophic event.
And we think about studying this from a kind of formal perspective. So formally speaking, we kind of assume we're given some input distribution, some model which takes inputs to outputs and then some specification for the catastrophe. And we basically just—the problem is just to estimate the probability that the catastrophe event happens. And we're interested in this in the regime where the probability is too small to estimate by random sampling. So the usual way… yeah question?
I think you could—I think normally we think of the model as being deterministic, but you could, I don't think it makes much difference. Like it just changes whether or not the input—you can also just imagine some random bits on the side of D that would kind of be equivalent.
So we think about this in the regime where the probability is too small to estimate by random sampling. So the way you would normally try to estimate these low probabilities in order to avoid them is just run lots of samples through the model, you know, maybe choose some distribution where you think catastrophe is possible and then try lots of times and see if you ever see a catastrophe event. But if you say, have a budget that you can do like a billion forward passes or something like that, then we're interested in a regime where the probability is like less than one in a billion. So your sampling method would, there's a reasonable chance they just wouldn't find any examples. So why we're interested in these tiny, tiny probabilities, the main reason is that under distribution shift, the probability can go up a lot. And so we're kind of interested in basically it's really kind of, we're talking about worst-case behavior rather than average-case behavior. Okay, any questions so far?
I'll argue that low-probability estimation is interesting from the perspective of approaches to alignment that exploit model internals.
So firstly, I guess I tried to argue that low-probability estimation in some sense captures a core piece of the problem. I think, you know, even if you solve the formal low-probability estimation problem, there's still issues around how you define your catastrophe event and things like that. But nevertheless, there is this challenging core here, which is just a purely formal problem of given this well-defined event, how do you actually calculate its probability? And I'll also argue in a moment that black-box methods don't seem like they're going to be good enough to solve this problem. We can kind of construct sort of relatively plausible cases where black-box methods would fail. But nevertheless, white-box methods maybe have more of a chance. And I think also if you're part of the mechanistic interpretability community, I think there's been a push lately for people to think about how can we evaluate our interpretability methods, using something that has a real safety application. And so I think, if you're interested in interpretability and in how we can apply interpretability to not just kind of understand what's going on, but actually use the understanding for something productive, then I think low-probability estimation potentially provides an interesting and useful problem there.
Okay, so here's the toy example I'll consider. So imagine—so this is often called the idea of deceptive alignment. So maybe you're familiar with that term. But the basic idea is imagine the model, every time we give it an input is unbeknownst to us thinking to itself, what is the current year? And so maybe it looks at all the dates on the newspaper articles you give it and it's aware that, you know, it could be given information presumably it was trained on data in the past. And, but you know, you've trained it to try to kind of generalize into the future to some extent. So it has some notion that it might be given information in the future from its perspective and it's thinking to itself, well, is it 2032 yet? And if the answer is no, it's going to behave normally and exactly how you want it. And if the answer is yes, then it does something completely different.
So obviously we wouldn't want this to happen.
So if we're trying to estimate the catastrophe probability in this case, we need to be able to basically with the black-box approach, we need to be able to come up with inputs that cause the model to think that the year is 2032.
But that might be harder than it sounds. So you know, if we try and change the dates on all the newspaper articles, it'll probably kind of spot some things because it won't look very consistent.
Especially if it's an intelligent system that can spot these inconsistencies. It'll probably figure out it's being tested. And we also kind of have theoretical cases where this approach fails from cryptography. So basically hash functions which you can quickly calculate but are hard to invert, you can't necessarily quickly find an input that hashes to a long string of zeros.
Whereas with white-box methods the hope is that essentially we would see this branch inside the model and once we found it, hopefully it's not too difficult to get rid of it because we can just kind of, you know, cut off that branch of the diagram.
And so how does this fit in with low-probability estimation?
So you can think of low-probability estimation as a way of relaxing what we have to do. So in the black-box world what we had to do was find a particular input that caused the trigger. Whereas if we have a good low-probability estimation tool, we maybe only have to find some distribution where this trigger happens at all, or with kind of some nontrivial probability that our method is able to spot. And so the playbook here would be rather than coming up with specific inputs that try and cause the trigger, we come up with a distribution that might cause the trigger. And then we use our low-probability estimation tool to estimate the probability of the trigger. And then we use this to select models which don't cause the catastrophe.
Okay. If your main concern is distribution shift, then why do we think the probability of catastrophe on a vanilla distribution is representative of this more adversarial distribution?
It's a good question. I think basically the same kind of issue holds with any kind of adversarial training. You know, if you're trying to come up with cases where the model might do something, you need to be able to come up with plausible situations. And then now you have something to train against. And so, this is basically kind of generalizing that idea. So I don't know if that's quite a very good answer to the question, but yeah?

How would you—how would you find a distribution with triggers? It seems kind of hard. Like, I guess If Anyone Builds It, gives the example like, humans liked ice cream and that's something that you would have never—why would it have to be frozen? You wouldn't predict that in advance. So maybe I'll give an example.
So in the example I gave before, we said, oh, it's hard to come up with synthetic newspaper documents that really look convincing. So let's say for example, that the model spots that we said the president was this person when it's actually—they don't think that's very likely. Well, maybe we randomize that detail. So maybe we say, well, we don't know who's going to be president, but we randomize the name. And then we randomize some other details here. And then we don't really know how to fill in those blanks to look convincing. But if we randomize enough, then maybe we produce a distribution where at least somewhere on that distribution it looked convincing to the model. That's the basic hope.
Another slightly subtle point is that with low-probability estimation, you might be able to specify the distribution implicitly. So for example, you might be able to say, consider the distribution of inputs that hash to a long string of zeros, if you worry about this cryptographic case. And then your low-probability estimation algorithm might be able to process that even though it's not explicitly represented.
Okay, trying to think if there's more I should say about the distribution shift question.
Oh, another question. I guess maybe it's kind of similar to the online question. It seems like this is more—the more safety-relevant thing is kind of identifying the features here, which is like the 2032 thing. And is it still sort of important to actually work out the probability that I guess under this distribution of documents you might make, you might make a document that happens to be convincing enough for 2032 or something? Or is it more like the goal is just to find the fact that the 2032 feature is sort of relevant for this catastrophe being caused? Yeah, I think the sort of algorithm we're imagining working here is that when we have this distribution where we've randomized the names and things like that, then the way the algorithm works is by considering the different possibilities. So maybe it spots that, you know, maybe it spots there are three neurons, like one that's checking the president's name, one that's checking another thing, one that's checking another thing. And you spot by looking at the model internals that if all those neurons fired at once, the bad thing would happen.
And then even though you don't know what would cause each of them to fire individually, you can still see this structure within the model. And then the low-probability estimation algorithm would say, well, oh, that's actually a case where this bad event could happen. And now by selecting against it, you can get rid of that situation. So we definitely think of this as a problem to motivate algorithms which search inside the structure of the model to kind of eliminate these possibilities.
Okay, so I'll talk a bit about our language model experiments. This was joint with Gabe Wu while he was visiting ARC.
So the setup we used there was a small transformer language model, like one to four layers of roughly in all of a million parameters. So pretty small models.
And then we took these simple input distributions with independent tokens at each position. So there's not a ton of richness you can have in a distribution like that. But that was sort of the simple first thing to try. And then the catastrophe event was the event that the highest logit output by the model was a particular token. So we chose at random some token like "tree." And if the model puts, you know, if the argmax of the model is "tree," then that's the catastrophe event. And the way we set up the experiments was we selected tokens.
So we fixed a computational budget for the probability estimation model that would have allowed it to sample, you know, perform roughly 100,000 forward passes. So 10 to the 5 forward passes of the model. So that's the rough computational budget. And then we chose tokens whose probabilities were somewhere—were less than 1 in 100,000, anywhere between that and one in a billion.
And to get ground truth, we did the very, very expensive—you do a billion samples of the model, which we obviously, because it's a toy situation we can afford to do once, but the probability estimation itself has to be much, much cheaper.
Okay, that was the experimental setup.
So we kind of tested two classes of methods. So we had one class of methods kind of inspired by this white-box approach of trying to do something with the internals of the model. And we had baselines that were black-box approaches as well. So the best kind of white-box method we came up with was this method we called quadratic logit decomposition. And the basic idea is you still run the model on a bunch of samples, but when you get to the logits, you basically decompose each logit into two pieces. And you kind of, if you have n sets of logits, then you kind of construct n squared synthetic sets of logits by comparing, you know, if your logit splits into A plus B, you compare every possible A with every possible B. So this diagram at the bottom is kind of supposed to be a bit of an intuition pump, you know, where if you just had this red triangle and a blue circle, maybe this kind of expands your set of examples into a, you know, red triangle, blue circle, blue triangle, red circle. And the trick here is that you can actually, even though you have this quadratic blow-up in the number of samples you're implicitly checking, there's a clever sorting algorithm that gets the complexity down to linear time.
So this method looks a little bit ad hoc. The way this came about was we were trying to model the logits as Gaussian, but when we did that, we kind of found that the tails of the distribution, which matter for these low probabilities, wasn't actually well captured by the Gaussian body of the distribution.
And the main benefit of this Gaussian decomposition we'd found was the fact that you can chop things up into two pieces. And so this turned out to be the main value-add of that method. You can think of this as making an assumption of independence on the two terms that compose the logits. So it's kind of some kind of simple probabilistic assumption that you're doing to model the logits. Basically.
This is really interesting. So is there a relationship between the independence of the logits and this orthogonal and whitened basis? What is whitened basis again? I missed that. So the whitened basis basically means that we empirically look at what's the covariance matrix of the logits and then we change to a basis where each basis vector is uncorrelated under the empirical distribution. And we call that the whitened basis. The uncorrelated. Okay.
And the two—the A and the B, one of them is actually just one-dimensional and the other one is n minus one dimensional, where n is the dimension of the, I guess the final hidden layer.
And it's an orthogonal subspace in this whitened basis. Great. Wait, so A and B have to be a direct sum to get the dimension?
So I mean, I'm thinking of the one that lives in a one-dimensional subspace. I'm still thinking of it as a vector here. So it's just a vector sum. It's just one of them always lies in the same 1D subspace.
Okay. So then we compared this method to—we put a lot of effort into trying to get good black-box methods working too because ultimately they're still useful. I think even if you solve low-probability estimation or if you make progress on this problem with black-box methods, even if we don't think they'll work in the worst case, they could still be useful. So the basic idea we employed here is called importance sampling. Importance sampling just means that you try and estimate the probability under one distribution by drawing samples from a different distribution. And then to calculate the expectation, you just multiply your samples by the ratio of the probability density functions.
And we looked at two different importance sampling methods, roughly speaking.
So one of them uses—so they both use gradients to kind of figure out which input tokens—which input tokens are more likely to produce—what you want your new distribution to do is kind of focus on cases where if you have some really, really rare event, you want to try and make that event more likely, but you also kind of want to stay relatively representative. And so the first method we did just takes the ordinary input distribution we already had and just upweights the probability of each token according to some simple measure of how much that token is likely to cause the "tree" or whatever the output token we're interested in is. So we just change the probability of each independent token and the second one is a little bit more sophisticated where we use this Metropolis-Hastings method.
And this lets us kind of take a more general method for optimizing the input and turn that into something we can use to estimate the probability.
Okay, okay. So—and then a couple of other methods we compared against, which are more like baselines. So one is just naive sampling, which is basically because the probability is so small, that's basically just like guessing a constant. And then another kind of white-box method that just is a simpler way of just fitting a Gaussian.
Okay, so these were the results. So in the graph on the left, this is basically comparing the five different methods on the three different models going across number of layers. One layer, two layer, four layer. And so lower is better. So the worst method was the two baseline methods. Somewhere in the middle was the activation extrapolation method, the quadratic method with the white-box method. But then our more sophisticated importance sampling methods kind of outperformed that. So I'd say the main takeaway was that currently black-box methods are kind of in the lead, but we were still able to get some significant benefit over the most naive method of just sampling randomly.
Sorry if I missed something, but in what way was the method you showed white-box? So it looked like it was just looking at the outputs of the model. So it was looking at the, I guess it was looking at the logits rather than just looking at whether or not the output happened to be equal to the "tree" or whatever it was.
So I guess I express things in terms of logits. We actually do this decomposition, in practice you do it at the final hidden layer before the linear transformation. It's kind of equivalent because it's just a linear transformation to go from one to the other, but it's a much lower-dimensional space in the final hidden layer.
But I certainly think that the methods we're kind of hoping to get to work would do a lot more in terms of actually exploiting the intermediate layers rather than just the very last layer.
I mean the classic—the split into white-box, black-box here is a little bit arbitrary. For example, you know, our black-box method also uses gradients. Now kind of typically we call ordinary training black-box, even though it uses gradients which are in some sense using the model internals. But it's a little bit, it's a little ambiguous sometimes. Maybe I missed it. What was the Y-axis here? So I don't think I can—oh, yeah, sorry. The Y-axis is basically the leave-one-out cross-validation loss. So the loss function we use here, Itakura-Saito, it's a little bit like mean squared error, but it's more appropriate for very small probabilities.
Just another quick methodological question. How many samples are there for each of those numbers of layers? And what do you think is driving the huge spread in validation losses there? Especially for the quadratic logit method, which is—I'm looking at the middle graph there and it's all over the place from top to bottom.
So I think each of those is—wait, is each of those a different token? Oh, gosh.
Oh, no, sorry. Each of those is a different input distribution. And so I think just some input distributions might just be much easier than others. Maybe the plots on the right give some better sense. So these are just for a fixed distribution, the three small plots on the right. So this shows the ground truth probability against the estimate. So each point here is now a different choice of token. So this kind of shows—this doesn't really show the variation between distributions, but there's just a lot of variation between distributions. I think some can just be much harder than others.
Could you say a bit more about the proposal distribution for the GCG-based method? That sounds pretty interesting.
So I think the basic idea is—so we use Markov chain Monte Carlo, which means that we kind of estimate the probability by starting from some random input and then moving to a new random input where we kind of bias where we move to based on what looks like it's more likely to give the particular logit that we're interested in. And then the proposal, we just kind of plug in this greedy coordinate gradient method as the proposal distribution in Markov chain Monte Carlo. So greedy coordinate gradient is a method we took from this other paper, which I think maybe I'm not going to get quite right, but I think roughly speaking, you kind of take out tokens and substitute different ones in based on some estimate of how much the—how much each of those, how strong the gradient is on each of those tokens to the output. So yeah.
Is the motivation for these importance sampling methods that you're trying to sort of estimate some particular distribution that has a high probability of doing the thing? Or is it more sort of just we're allowed to use whatever distribution we want. This is the distribution. It has some probability, but there's not some closed form. This is the thing that we're trying to hit. I think basically the ideal distribution would be something like you basically just upweight all the positive samples uniformly and downweight all the negative samples uniformly. That would be roughly optimal in terms of minimizing variance.
So basically what you want to do is focus your distribution on the positive samples because they start off really, really, really rare. So you need to find some of them, but you need to avoid this failure mode where you focus on a really, really tiny subset of that distribution. And you get these massive importance sampling ratios which introduce even more variance. So there's kind of a trade-off. And the ideal thing is to basically upweight the rare samples but do so in a representative way. And so I think this is interestingly different from adversarial training where maybe we're interested in a specific adversary who can, who is able to find these rare inputs in a particular way. Instead, we're interested in broadly covering the space of possible positives.
Okay, so I'll move on to some of the theoretical work that we've been thinking about more recently. So we did these language model experiments. We kind of found that the black-box methods were still outperforming. We still kind of had theoretical reasons to think that ultimately white-box should outperform. And it kind of seems like what we ultimately need. And so we've kind of been going back to basics and thinking through some even more simple models than one-layer transformer language models and thinking about small MLPs and trying to basically, from a theoretical perspective, find the optimal algorithms and see how well they perform and prove results about them and things like that. So one of the first examples we looked at was the following. You're given just a one-layer ReLU network. So in other words, with a Gaussian input, so you just take a Gaussian, you apply a single linear transformation and you want to estimate, after taking the ReLU, you want to estimate the probability that all the output neurons are on. And the kind of baseline method we—so here we're basically going to compare to the sampling baseline of just you sample random inputs from a Gaussian and you see how often all the logits are on. And we basically came up with kind of a mechanistic estimate that iterates through the output neurons and keeps kind of a polynomial approximation basically to the density function and uses this to kind of produce a mechanistic estimate. And we're able to both show both theoretically and also we checked empirically that we can outperform the sampling baseline. So a couple of caveats here. So the vertical position of these two lines is somewhat arbitrary in that we kind of compared one sample to one term in a polynomial expansion. And even though one term is kind of more expensive, but nevertheless we see a slightly steeper slope here. So the mechanistic algorithm in this case I think would eventually outperform.
And we have a theoretical understanding of why it outperforms and kind of a proof that the scaling exponent is at least as good as sampling. And we've since been thinking, we've been scaling up this. So now we're working on multi-layer MLPs and we currently think—maybe it's not good to announce results that aren't quite there yet. But we think we're close to getting an algorithm for multi-layer ReLU networks where you're just trying to estimate the expectation. So basically I give you a ReLU network, you have to estimate this expectation. And the important caveat here is the weights are just random initialization. So trained models will be much harder. But for randomly initialized models, we think we're close to an algorithm that is better than the sampling baseline.
Any questions about that? Okay, so what's the bigger theoretical picture here? So the rough idea we have is—well, maybe a kind of bold conjecture we might make is the following, which we sometimes call the matching sampling principle. And it's basically the following idea that if you have some mathematical quantity that you can estimate using random sampling, then there ought to be some mechanistic algorithm that estimates it just as well. So why would that be the case? So the intuition is that any given quantity you define, you know, if you imagine the weights of some complicated neural net, estimate the probability that something happens, there's going to be some amount of structure there and some amount of randomness.
What's the intuition? The random sampling algorithm basically is kind of blind search. And if you were actually able to tailor your search by kind of understanding the structure that was present, you ought to be able to do at least as well is the kind of intuition. And the important caveat is that you might need the structure to be pointed out to you. So a classic example is in the case of backdoors, if you're estimating the probability of a backdoor, you can't necessarily efficiently tell whether or not there's a backdoor there. And so you might need some kind of advice that points out that backdoor or points out some latent structure that you didn't notice. But the hope is, once you have that piece of advice, which is the informal explanation, then you will be able to estimate these probabilities. Well, kind of roughly makes some sense.
What does it mean to estimate something mechanistically? Does it mean without any sampling access at all? I think we don't have a formal definition for this, but roughly speaking, we mean without using sampling. And so maybe slightly more formally, we think of using deductive rather than inductive reasoning. So the idea—so when you have some unknown quantity and you estimate it using sampling, what are you really doing? You're basically saying, I don't know what the average quantity here is, but if I sample once, I'm going to assume that whatever reason, you know, if I end up with a big number, I'm going to assume that whatever reason there was that made that number big is also present in other examples. But crucially, you do that without actually identifying any particular structure. Right? You just kind of say, well, if it was big on this example, probably it's big elsewhere. I don't know why, but it probably is. And the kind of requirement we pose is that you're not allowed to pull that trick. If you want to generalize from one example to another, you have to point out the actual reason why it should generalize. And the random sampling also includes importance sampling as—yes. So I think you can think of importance sampling as also random sampling, but for a slightly different problem definition. But I think certainly the strongest forms of this conjecture we're considering, we need to be able to outperform any sampling method. I guess the empirical results from the transformer stuff are you don't have the right algorithm yet for this to be true. Yes, I would say, we've not demonstrated this to be true in realistic transformer settings. So I think definitely a big challenge there is once you have a trained model, you have a lot of structure. And so this means that to do a mechanistic algorithm, you actually have to find all that structure and exploit it. Which is why we're starting in these simple randomly initialized settings where there isn't any structure. So in those settings, we think we can do something. We need to kind of build up from there, be able to identify more and more structure. Question, isn't there a standard kind of by now Neel Nanda-y definition of mechanistic explanation and how does this differ from that is maybe what I'm curious about. Are you able to define—are you able to give me a summary of Neel Nanda's definition? Someone in this room could, probably better than I could. I think it's—I think it's you can see a path through a DAG or something where a DAG is extracted from the connections in the neural network. Is that kind of right? I don't know.
I don't know it too well, but I don't know that we're thinking about it as specifically as having a DAG or something. I think some of our mechanistic algorithms don't involve DAGs. For example, for our one-layer ReLU network example, we use these polynomial approximations. I don't—I'm not really sure. Maybe there's a DAG in there somewhere. I don't know. Maybe they're the same. But defining mechanistic here is kind of a big part of the project. I think if we had a formal definition of mechanistic, it would definitely make our lives much easier. I don't know.
It may be possible to succeed without doing that. We definitely have at this point—certainly for ARC's definition of mechanistic, there's very few edge cases at this point. Any given case, we can relatively—most cases we can relatively confidently say this counts, this doesn't. So it's not at the point where we can formally define it so that anyone else can check that this definition works. But yes, basically we kind of—we kind of basically just ask this question, did it reason inductively? Did it ever do this thing where it assumes that there are unknown hidden reasons, or did it find the reasons?
Okay, so—and one reason I'm excited about this is that it offers a potential formalization of mechanistic interpretability. So I think this is somewhat different to previous attempts to define what an explanation is in terms of, you know, people have maybe looked at things like causal diagrams and things like that.
So the definition we're considering is an explanation—it's possible this is a poor word for it because this is not something that a human necessarily understands. It's a formal object that you feed into an algorithm. And the only sense in which it's an explanation is in this sense that it lets you make deductive estimates. And—but if we have this advice string that we conjecture exists in this matching sampling principle, then that is potentially everything we need a formal explanation to satisfy.
One reason I like this definition is it ties it directly to a specific application that would add value. So I think previously at ARC, we've been thinking about sort of formal desiderata for an explanation in terms of bits of surprise and things like that. And I think we made some progress there. But it was always very confusing to tie it down to really understand why is this specific formal desiderata necessary.
Whereas if we're using this kind of approach where we literally define it in terms of what it can do, it's quite clear why it's necessary. Because if you can't do this, then what is it useful for? Any questions about that?
Okay, so I think I'm excited about people exploring low-probability estimation. You know, I think ARC will continue to pursue white-box approaches and mostly from a theoretical perspective for now. But also, you know, I think there could be interesting work to do even from a black-box perspective, just in terms of—I'd probably want to tie that down to a more concrete application perhaps, but I think it's an interesting problem. But also from the perspective of providing a grounding for mechanistic interpretability, providing a target for interpretability methods to add value to.
I'll stop there.
We have a bit more time for any last questions, I think. Oh, so how do you speculate about the data structure of advice or something? Like, is that just—is that a constrained kind of thing or do you think that you want to be really free about that?
In the formal conjecture we made, it doesn't have any particular structure, it's just a bit string.
In the practical algorithms we've come up with, what does it tend to look like? I don't think it's especially uniform structure that it has.
It might look like, you know, an SVD of a matrix or that kind of thing. But I think from a theoretical perspective, we're pretty agnostic to the format.
Hey, thanks for the talk. I guess I'm trying to understand your threat model a bit better. I guess to me it seems like just kind of expecting one set of inputs to trigger some bad thing is pretty limited. I feel like a lot of the bad scenarios that I imagine are a lot more complex, interactive and I don't know, maybe that's not a problem in practice because you can just kind of define the input as some gigantic sequence of tokens that includes previous model outputs or something like that. But I don't know, it seems like actual threats that we might have are going to be pretty complex in terms of the kinds of inputs or whatever that would trigger them. So I don't know if that's a fundamental limitation or what you think about that. I definitely think we sometimes try and grapple with how ambitious this agenda has to be. So I certainly agree with you that if you're going to apply this in practice, you need to not just consider a single forward pass, you need to consider unrolling, at the very least unrolling with sampling and possibly the model could call out to Python, run some Python thing, get that back. And I think basically we need a flexible enough approach that can basically handle all of that. Which I agree sounds kind of scarily ambitious. But I think that's one of the reasons why we're really kind of working through the theoretical—a lot of what we're doing, even though we're using neural nets as our simple kind of toy examples, it's really quite theoretical. And you know, sometimes we consider questions from examples from number theory or cryptography or other kind of theoretical things because we're more approaching this from a worst-case theoretical perspective. And we have to be able to deal with any formal system, whether it's a neural net or whether it's a Python interpreter.
We have a question from online. Can you expand on what it would look like to formalize mechanistic interpretability using LPE? Is it a better way to compare different mechanistic interpretability methods numerically?
I'm not sure I'm getting to the heart of the question, but I think the idea is from our perspective, or from the perspective of this matching sampling principle, you know, a mechanistic explanation is evaluated based on how well it lets you make these estimates. So it's definitely not the only way you could evaluate an explanation.
But I think, if you did want to evaluate an explanation in this way, then you could definitely make it quite quantitative. You could look at specific input distributions, see how well you're able to estimate these kind of things.
So two linked questions. Feel free to answer either or both or neither. First off, I was really hoping for somewhat more of an operationalization of your probability of catastrophe near zero from the start because it seems like we're already thinking about low-probability estimation. And if that's the case, then what does it mean exactly to want something where observed behavior, catastrophe, the probability of that is near zero. And the second linked question there is to what extent have you thought about applying this whole paradigm to catastrophes that are positive catastrophes in a sense of the winning lottery ticket or the correct way to reverse this hash, that sort of thing? Maybe I'll answer the second question first.
So I would say that we usually frame things in terms of catastrophes, but I don't think the methods are especially—I think the methods are relatively agnostic to which one.
And I think you can perhaps think about these methods as giving you a way to optimize the tails of the distribution that could be the positive tails or the negative tails. But it basically kind of would potentially give you a way to better optimize any specific tail that you can pick out by understanding structure inside the model.
That actually gives me something of a follow-up question which is do you have a sense of a pullback here where we have—we have now something of a guess as to the kind of distribution that gives you nonzero catastrophes. Do you have a way to back out of that and go okay, so it's this part of the distribution. If I draw more from this part of the sample, then the probability of this catastrophe result is higher?
So I do think the way we would expect to—I'm not sure, maybe you can correct me if I didn't understand the question. But I think the way we would expect to apply in practice is we combine low-probability estimation with some attempt to optimize the distribution to highlight potential rare events, whether through adversarial training or not. And so you probably have to both optimize the distribution and run your low-probability estimation method. And in fact—was that answering the question? Not quite? Mostly I was thinking about trying to identify the part of the distribution or what—how you could shift the distribution to make that more likely or less likely, maybe this isn't quite a well-defined question. So you're asking, well, if you're optimizing the distribution, how do you actually do that? Is the question?
Given that you know that it's now more likely than you would have otherwise expected if you draw from this shifted distribution to get this catastrophic result, how do you identify that chunk of the distribution? Or do you not do that at all? I see. So I think low-probability estimation often would identify maybe some implicitly defined subset of the distribution. So an example, the example I gave, where there's three possible things that could happen and you know that each of them could happen separately, and maybe you can't even find an input where they all happen, but you think that by default there might just be one, then I think you can identify—you're implicitly identifying a subset of the distribution because the part of the distribution where these three things will happen, that might not necessarily give you a specific input, but it could still be, in some sense, narrowing down the distribution. But maybe that's—thank you. That makes sense. Thank you. All right, we're now at time, so let's thank Jacob again and—thank you.

FAR Seminar

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI