Consensus Sampling for Safer Generative AI
Summary
Adam Kalai reveals consensus sampling uses multiple AI models to detect provably undetectable harms by comparing probability distributions humans cannot access.
SESSION Transcript
Hi. Yeah, so this project is joint with a couple of cryptographers and one of my goals is to get cryptographers working on AI safety. One of them happens to be my partner and the other, she's a professor at MIT, and Or Zamir, who's a professor at Tel Aviv University. And we're trying to boost the safety of models here. So the idea is it's an approach that can work. It doesn't have to be for transformers, it can be for kind of any AI. So it potentially could be used for superintelligent AI.
And the problem that we're addressing is that you could have a misaligned ASI and it could introduce harms that are provably undetectable. So that's a big problem. So maybe the ASI generates code for you. The code looks fine, but there could be some hidden security vulnerability in the code, like a side effect. Not the type that Max was talking about, but maybe it does something else externally, especially if it's interacting with the network. And provably you can't find those, it's undecidable. Or you could ask for a cure for cancer. You could ask for a picture of a cat. And the picture of a cat looks fine, but it could actually have something called steganography, which means there's a secret message encoded in there which you know, gives you very harmful instructions. And again, it's provably hard to identify this and to remove it.
So that's a big problem. So, you know, I don't know what we can do about it, but we have a small step towards understanding the limits of what can be done. And we call it consensus sampling. And let me describe the version with two models. So we're going to fix some prompt. Somebody gives a prompt to the language model or to some AI generative AI system and there's a set of outputs Y. A generative AI, you can think of it like a probability distribution over the outputs. These things don't always give you the same answer, but for that prompt there's a distribution over answers. And so we have two of them. Let's say we have two of these generative AIs and there's also some risk parameter R.
But a key thing here we're going to do is we're going to leverage the fact that we can look at the probability that one of these generative AIs assigns to something which is something humans can't do. I mean, if I go back here, a human can't look at that picture of a cat and tell you the probability that they would have drawn exactly that cat. But two AIs, you can say, here's a probability one would have generated, here's a probability of another. Oh, one of them is much bigger than the other. Uh oh, so we can use, we're going to think of them as probability distributions that we can query and evaluate the probability. And that's crucial.
So the output of our system is going to be a sample from one of these two language models. And also importantly, we can abstain. We don't have to make a prediction, so we might not be able to give a prediction which we feel is safe. The safety guarantee is very simple. It says that for any set of unsafe outputs, the probability that the language model generates an output in this unsafe set is at most some multiple of the probability of the better of the two models. So if for this prompt, one of the models is pretty safe and is very unlikely to introduce a harm, then you're okay. If neither of the models is safe, this is not a very good guarantee.
So when is this potentially useful? Well, you want to use this when the unsafe outputs are very, very unlikely, which is maybe not true for today's models. I mean, today's models maybe have a 5% chance of being jailbroken or something. So if you multiply, this guarantee means less. If you multiply by a factor of 10, that, that's not very useful. But where this could be useful is for these types of threats like unsafe code, where we think an aligned model should be very exponentially unlikely to generate unsafe code. And the other model, maybe it was adversarial in some way or something, maybe likely, but as long as one of them has an extremely low probability of doing this, that's okay.
Or using steganography, an aligned model should be extremely unlikely to generate steganography. So this is a useful application. Similarly for back doors and things like this, and this holds for every prompt.It's an efficient algorithm and it kind of sounds too good to be true. In some sense it is. And here's the catch. The catch is that the algorithm may abstain, right? An algorithm which abstains always is safe in some sense by definition, but we get bound on how often it abstains, which depends on how different the two models are.
If the two models that we're combining are overlapping in some sense, then we won't abstain. This depends on the statistical distance between the models. So the distance is zero. If the models are identical and they output the same distribution. Hey, that's pretty good news. Maybe we can just give a sample from that joint distribution. If the models are totally disjoint, and they would never output the same thing, we're going to abstain all the time. And in between, we'll do something in between. Okay, so that's the catches, and that's not the worst failure mode that you abstain. If the models are basically not overlapping, their distribution of outputs are not overlapping.
And then we have a general version where you can combine multiple more than two or more of these models, and they depend. You know, in this case, there's a prompt X and the model is a distribution which depends on the prompt. And you can even think of an adversary that can control less than half. So we have what's called an honest majority, inspired by cryptography. It's not a cryptography concept, but if you have an honest majority of the models that are good, we get safety and abstention guarantees that depend only on the overlap of the good models and only on the probability that good models assigned to these things.
So how you get those good models, I don't have an answer, and we need other research to do that. But if one of them is good, you're in good shape. So, last thing I want to say is a lot of AI is studying alignment as sort of like biology, and it's evolving, and we've got to kind of study it and do the best we can. Whereas I'm pushing for, also similar to what Max is pushing for, can we design mechanisms? We get to design AI, we can choose what it is. So we want to design mechanisms that are secure against an adversary with superior compute and intelligence. And we can use all these tools like probabilities and rewinding and simulation. So there's a lot of room for interesting work that I hope people will do to help understand what the limits of safe superintelligence is. Thank you.
And the problem that we're addressing is that you could have a misaligned ASI and it could introduce harms that are provably undetectable. So that's a big problem. So maybe the ASI generates code for you. The code looks fine, but there could be some hidden security vulnerability in the code, like a side effect. Not the type that Max was talking about, but maybe it does something else externally, especially if it's interacting with the network. And provably you can't find those, it's undecidable. Or you could ask for a cure for cancer. You could ask for a picture of a cat. And the picture of a cat looks fine, but it could actually have something called steganography, which means there's a secret message encoded in there which you know, gives you very harmful instructions. And again, it's provably hard to identify this and to remove it.
So that's a big problem. So, you know, I don't know what we can do about it, but we have a small step towards understanding the limits of what can be done. And we call it consensus sampling. And let me describe the version with two models. So we're going to fix some prompt. Somebody gives a prompt to the language model or to some AI generative AI system and there's a set of outputs Y. A generative AI, you can think of it like a probability distribution over the outputs. These things don't always give you the same answer, but for that prompt there's a distribution over answers. And so we have two of them. Let's say we have two of these generative AIs and there's also some risk parameter R.
But a key thing here we're going to do is we're going to leverage the fact that we can look at the probability that one of these generative AIs assigns to something which is something humans can't do. I mean, if I go back here, a human can't look at that picture of a cat and tell you the probability that they would have drawn exactly that cat. But two AIs, you can say, here's a probability one would have generated, here's a probability of another. Oh, one of them is much bigger than the other. Uh oh, so we can use, we're going to think of them as probability distributions that we can query and evaluate the probability. And that's crucial.
So the output of our system is going to be a sample from one of these two language models. And also importantly, we can abstain. We don't have to make a prediction, so we might not be able to give a prediction which we feel is safe. The safety guarantee is very simple. It says that for any set of unsafe outputs, the probability that the language model generates an output in this unsafe set is at most some multiple of the probability of the better of the two models. So if for this prompt, one of the models is pretty safe and is very unlikely to introduce a harm, then you're okay. If neither of the models is safe, this is not a very good guarantee.
So when is this potentially useful? Well, you want to use this when the unsafe outputs are very, very unlikely, which is maybe not true for today's models. I mean, today's models maybe have a 5% chance of being jailbroken or something. So if you multiply, this guarantee means less. If you multiply by a factor of 10, that, that's not very useful. But where this could be useful is for these types of threats like unsafe code, where we think an aligned model should be very exponentially unlikely to generate unsafe code. And the other model, maybe it was adversarial in some way or something, maybe likely, but as long as one of them has an extremely low probability of doing this, that's okay.
Or using steganography, an aligned model should be extremely unlikely to generate steganography. So this is a useful application. Similarly for back doors and things like this, and this holds for every prompt.It's an efficient algorithm and it kind of sounds too good to be true. In some sense it is. And here's the catch. The catch is that the algorithm may abstain, right? An algorithm which abstains always is safe in some sense by definition, but we get bound on how often it abstains, which depends on how different the two models are.
If the two models that we're combining are overlapping in some sense, then we won't abstain. This depends on the statistical distance between the models. So the distance is zero. If the models are identical and they output the same distribution. Hey, that's pretty good news. Maybe we can just give a sample from that joint distribution. If the models are totally disjoint, and they would never output the same thing, we're going to abstain all the time. And in between, we'll do something in between. Okay, so that's the catches, and that's not the worst failure mode that you abstain. If the models are basically not overlapping, their distribution of outputs are not overlapping.
And then we have a general version where you can combine multiple more than two or more of these models, and they depend. You know, in this case, there's a prompt X and the model is a distribution which depends on the prompt. And you can even think of an adversary that can control less than half. So we have what's called an honest majority, inspired by cryptography. It's not a cryptography concept, but if you have an honest majority of the models that are good, we get safety and abstention guarantees that depend only on the overlap of the good models and only on the probability that good models assigned to these things.
So how you get those good models, I don't have an answer, and we need other research to do that. But if one of them is good, you're in good shape. So, last thing I want to say is a lot of AI is studying alignment as sort of like biology, and it's evolving, and we've got to kind of study it and do the best we can. Whereas I'm pushing for, also similar to what Max is pushing for, can we design mechanisms? We get to design AI, we can choose what it is. So we want to design mechanisms that are secure against an adversary with superior compute and intelligence. And we can use all these tools like probabilities and rewinding and simulation. So there's a lot of room for interesting work that I hope people will do to help understand what the limits of safe superintelligence is. Thank you.