Disentangling Agency & Predictive Power Without Solving ELK
Summary
Yoshua Bengio argues we can build trustworthy AI by training predictors that, like the laws of physics, understand the world without developing goals or preferences for particular outcomes.
SESSION Transcript
Thanks for having me for this keynote. Thanks for the organizers. I've been lucky to speak at this venue for a few times privileged.
I want to update you on things I started talking about in previous presentations at previous meetings that has to do with the scientist AI research program. Maybe the main take home message of the update is I'm now more than ever convinced that it is possible to build machines that will be completely trustworthy and understand the world very well.
And so that's a path that is different from what we've done in AI in the past, which was trying to imitate human intelligence or using human intelligence as a template. Because in humans we can't disentangle our agency and our desires in the world from our understanding of it and our ability to make predictions and get things done.
There is another interesting connection for people who are familiar with the literature in AI safety and that is the question of eliciting latent knowledge, which hopefully I'll give you a glimpse of with the work that I'm doing.
Adam already gave us a lot of motivations. We don't want AI systems that are very smart and have goals that could be bad for us. But when I think a little bit deeper about this question, I come down to the issue of AI systems that are capable and have goals that we did not choose. And usually it's because these goals have been arising in an implicit way. When you learn from RL, you end up with some goals that were not necessarily spelled out explicitly and that contained the way, for example, that explicit goals in instructions will be interpreted. And then you also have other sources of implicit goals coming from pre-training because AIs are imitating people. And again, it's not explicit, but you get this.
So what does an AI need to cause serious harm, not by mistake, but really through having to deal with entities like us who don't want to die? Well, they need to have the capability to harm and they need to have the intention to harm. So uncapability. Well, we have evals, that's great. And we need to continue tracking that, but it's likely that capability is going to continue. So in the long run we need to figure out how to control these goals that they have, these intentions. And as a computer scientist, I like asking the extreme question, well, what about can we completely disentangle goals from understanding of the world? Which of course is not the case for humans.
And so I've been working on this research program inspired by sort of an idealized scientist, but it's like a theoretical scientist not an experimental scientist, a scientist who just tries to understand the data. The experimental data is not planning any experiments. We can use that later for planning experiments. But another way to think about it is the laws of physics. They don't care about you, they don't care about me. They don't have any preference for states of the world. If you were to run a very good simulator of the laws of physics, you would get something trustworthy. And so there's a sense in which it should be possible to build AI systems that are just as good or approximate, approximating something like a scientific understanding of the world and using those predictions in a way that is not biased by particular preferences for states of the world.
So one issue is that even if the AI understands how the world works, how do we answer those questions if the questions we are interested in do not correspond to directly observed random variables? So the issue that was pointed out by Christiano and collaborators with the ELK paper is that when we train, say, a graphical model, we have to give it examples of the random variables on which we can then query it. And if we don't have those examples, then we may not be able to get questions about the latent variables, the causes or the interpretations that are not directly observed. So I'll try to tell you a strategy to get around this.
The scientist AI is going to try to do that by learning a language, which is natural language, in which to express the latent variables so that we can make those latent variables interpretable. And the target for training the scientist AI is going to be the Bayesian posterior. So the Bayesian posterior, assuming somebody gives us a fairly uninformed flat prior, is just probability of some statement Y, which we care about, given some context X, and given all the dataset, and we're going to approximate that with a neural net or an AI system Q.
Now, the important thing is that the way these latent variables or these observed variables are expressed is with a statement in natural language, right? So this is not the usual thing you have in classical machine learning with probabilities, positive models where each node gets its meaning from the fact that we give examples for that supervised node. I mean observed nodes instead. The meaning of those nodes, those variables, is given by a piece of text that describes what that variable is supposed to represent.
And in fact, if you think about LLMs, well, they're kind of doing something like this already, right? So it's not crazy. But the interesting thing is, now the number of random variables we're talking about is exponentially large. It's all the things that you could express with little bit of natural language, all the properties of the world that you could express in this way. And you don't need to actually see all of those variables. You're going to get generalization, because generalization across language, you're going to be able to generalize to properties of the world that you've never seen simply because they are expressed in natural language. And you've seen other properties of the world expressed in natural language.
Now, there is a really important piece of the puzzle here in the research program, which we call this truthification pipeline. It's called truthification, because what we're going to do is we're going to take the same data that we use to train current frontier models, not just scientific data, but all data, including text that people write, and we're going to transform it into a bunch of statements so these latent variables, or observed variables rather in this case, that say something about a property of the world in such a way that all of those statements are true.
Now, of course, the problem is a lot of things people write are not true. They have intentions, they're misguided. And so we're going to introduce a syntax that turns the things we're not sure are true into things we're sure are true. So let's say that somebody writes a sentence X. We're going to use a different syntax. When we're sure that X is true, we could just write something like X is true. But when we're not sure because it's really a communication act, somebody wrote this, and maybe this is not true, then we use a different syntax that someone wrote X. So we just put a bit of syntactic sugar after having decided, based on the source of the data, whether these are things that we're sure of, like theorems that have been proven, or rather like tweets or even claims in a scientific paper, should be in the category of communication acts.
This has multiple goals, but one of them is of course, for the AI to understand what is trustworthy and what is not, so that it can come up with explanations for why did somebody write X, which might be different from X is true. But also then it allows at runtime to query the AI using the factual syntax so that we can get access to what the AI really thinks, rather than what a human would say in this context. And this is important from the point of view of making sure that the AI is not agentic.
Okay, so let's talk about agency. What is that? I'm going to make some particular definitions of agency argue why this is a good definition. So I'm going to say that an AI is agentic not just because it causes things in the world, but because it causes things robustly, that it has goals that it achieves in spite of randomness and in spite even of adversaries. In fact, the kind of AI agency that we want to avoid is an AI that is so agentic that it could defeat us, that in spite of our willingness to live, it somehow kills us. So this is a very high level of agency. And in general, we're going to think of agency as something graded where let's say zero means no agency at all, like the laws of physics. And some high level is going to correspond to something really dangerous if the goals are bad, of course.
So we're going to even think of how to quantify this notion of agency, this grade as a sort of number of bits of description that are needed to make sure the different actions that the AI is going to take actually achieve their goals. So to make this a bit more concrete in your mind, think about a very simple environment playing the game of Go, right? So I used to play Go. So when you play Go, typically if you want to win against a reasonable opponent at each move, you, you should really. There's only a couple of moves, 1 in 100, let's say, that are good, so you better choose them well. So you have 1% chance if you were to choose your moves randomly, you only have 1% chance at each step to make the right move. And you have to do this for about 100 moves, roughly. And so what's the chance, through random policies, that you're going to hit a winning strategy? It's 0.01 to the power 100. That's 10 to the minus 200. That's very, very small.
And so we're going to take the log of that and say that's the number of bits that you need to describe these winning policies. What's interesting here is that we can also use that as a score for agency. How complicated are the goals that the agent is able to achieve using that quantity? And now we can see that the volume in the space of all predictors, because we're going to train the predictor that corresponds to something agentics like bias predictions in order to make us do something bad, like give it more power, more compute, that volume of agentic predictors is going to be exponentially small in that score. And that's a very useful thing to keep in mind.
And in particular, it's going to have an impact on our ability to avoid this. What we're going to do when we train our scientist AI predictor is we're going to train it to make good predictions and more precisely to approach the Bayesian posterior, which I'm going to claim is not agentic. But on the way to this training objective, can we by chance end up picking an agentic predictor? And because the agentic predictors occupy such an exponentially small volume in the space of predictors, I'm going to claim that this is going to be very unlikely. And there's more to say about these arguments that I'm writing up. But that's really a key ingredient.
The other thing we need to be careful about is that precisely what we're optimizing and the details of how we're optimizing it is going to matter. We need to make sure that we're not optimizing, for example, for future prediction errors, future after deploying the AI, because otherwise again, there will be an incentive for the AI to create bias predictions that somehow influence the engineers to give it more compute and you know, for the AI to control humans so that it'll have access to data centers and all the bad things.
So there are two conditions here. One is there shouldn't be any interaction during training between the AI and the rest of the world, at least not a two way interaction. Of course the AI can use the data that it's given, but it shouldn't be doing online RL, for example, with respect to the prediction errors that it gets in the real world. And the second thing is it's not just the actual interactions, but we want to also avoid anticipated interactions. So the AI could be modeling the world and modeling the effect of its predictions being biased and potentially giving it ways to reduce future prediction errors. So that also is ruled out. And there are technical kind of definitions of how I think we can rule out both of these things that are actually quite feasible with like reasonable ways of setting up the training procedure.
The other thing we need to make sure is that the posterior, the target for optimization is going to be non agentic. I already mentioned the claim that if you had a really good simulator of the laws of physics, this would not really change based on the effects on those predictions. There's a technical thing here that you have to say that the model is able to anticipate what will happen in the future after conditioning on the particular deployment of the predictor. Maybe too detailed for this talk. And so using this as a basis, we can argue that both the final point of training as the amount of data goes to infinity, but also intermediate points which are these target posteriors for a finite amount of data really are non agentic.
And then going back to the question of ELK, the challenge of eliciting latent knowledge. I'm going to start by asking you to think of a predictor for say boolean events as a three way classifier, what I call a membership classifier, which says whether a statement given context is either true, false or unknown. So you can take the probability and just threshold if it's close enough to 0 or 1, you say it's true or it's false. If it's in the middle you say it's unknown. And now you can apply and it's trained on things that are true or false, but for sure you know that are true or false. That's the truthification pipeline. And so now you can apply learning theory to ask what happens asymptotically with enough data, enough compute, as your system generalizes better and better and something interesting happens, which is that it's going to learn to predict things that are true.
So what are the things that are true or false? For sure, for the most part is going to be things you can deduce from the things that you were given. Now the set of things you can deduce that is really limited by your compute power. So if you had infinite computer, you could deduce everything that can be deduced, but we don't. And so when we train the AI, we only give it examples of things that can be deduced with the sort of algorithms and compute that we have. And so now the AI could generalize about what is the correct deduction using the algorithms that we use. But it could invent new algorithms. It's really powerful and we go into superintelligence somehow and be able to make predictions that go beyond in terms of deductions, what we are able to do. But the good news is that wherever it chooses to generalize about true things that can be deduced, they are going to the worst thing that can happen. I guess that's the next slide.
The worst thing that can happen is that there are things it knows that can be deduced, but it doesn't tell us. And so it can choose depending on how it generalizes to only tell us at the worst case the things that humans would say can be deduced to be true with enough compute and so on. Or it could generalize in a broader way. But as a consequence, when it says that it doesn't know, you don't know if it's true or false. But when it says that it's true, true or false for sure, you can trust it. Right. There's a kind of asymmetry between the sort of honesty that we can get. And I call this a property of being epistemically correct. So now, when the AI issues a high confidence claim that something is true or false, it does not lie. Right. That's, of course, as symptomatic as the amount of data and compute increases.
One little parenthesis here is that the vast majority of questions we want to ask are not about things that we know for sure are true or false. Right. And so how do we ask questions about harm which are like, random in the future? Right. Well, the previous argument about epistemic caution not lying when you're highly confident only applies to things that are true or false, for sure, fundamentally. But that does not include uncertain things like future harm. But here's the good news. You can take an uncertain statement like something bad is going to happen in the future, and you can turn that into a certain statement by asking, what is the probability under the Bayesian posterior that this thing will happen? So you can take an uncertain statement like harm will occur. And it's not really either true or false. I mean, it has a probability. And so we can ask instead a question. We reformulate the question as the probability of that event is between 0.9 and 0.95. That's a binary statement, Boolean statement that is either true or false for sure. And if you have enough compute, you can know whether it's true or false. And so now it fits the previous arguments. And that means if we teach our AI about producing accurate confidence intervals about the things that are uncertain, we're going to get all of these asymptotic guarantees that I've been talking about, but not about whether something is true or false, which would be ridiculous, but about the probabilities of these things.
Okay, so going back to the ELK challenge, what we're saying is we're not actually solving the ELK challenge because there might be cases where the AI somehow knows things that are true and doesn't tell us, but when it does say with confidence that something is true, we can trust it. It's not a complete solution, but at least it's something that, from a safety perspective is sufficient. If the AI is not sure that something is going to be harmful or not, then we should be conservative and not do the action. But if it says that it's not going to be harmful, at least the probability is below a threshold. That's really the question we're going to get, then we can feel safe.
So that's one aspect. The other aspect is we're going to use the fact that the latent variables are expressed in natural language in order to give us a guarantee that we can query the AI about those variables that are uncertain. And maybe because they're not observed, or maybe they're not even observable. There's much more to be said about this argument, but that's sort of the gist of it. All right.
I want to mention that this research program I'm highlighting is what motivated me creating this nonprofit called LawZero that is trying to build such a thing and figure out the methodology for this. It's in Montreal, Canada, but we are recruiting for people also in Europe, in case you're from there. So thank you for your attention. Happy to take questions. I see a hand in the back.
Q: Thanks for the very interesting talk. I was wondering, even if we get an AI scientist that can very accurately predict the harms, if we are using it as sort of a judge or a threshold to decide what things should be deployed, that other agent would have to know something about utilities or human values. So I was wondering how you envision this being part of a greater pipeline.
A: So this is only a piece of the puzzle of building safe AI that include agency. But I feel like it's an important building block from which we can build safe agents. And I didn't tell you much except we can use it as a guardrail for unsafe, untrusted agents. Ultimately, we need to design those agents as well to make sure they don't exploit, for example, adversarial attacks on the guardrail scientist AI itself. And I have plans for that. And who defines what is right and wrong? Well, that should be a social choice, obviously, hopefully in a democracy. And this is a building block for how we can build more complicated systems that interact with the world, where humans are going to be the ones defining what are the red lines. And there are many more things that I didn't talk about. If we had something like this, for example, how do we use it to make sure that an AI agent doesn't have bad instrumental goals? There are lots of interesting questions, but that's for future talks.
Q: Thanks for the talk. You laid out a specific research trajectory with a couple components which, if executed well, could be quite interesting, right? Could produce an interesting system. If in a couple years part of this didn't work well, or like it was really hard, really sticky, which part do you think it'll be like? What part of this trajectory do you think is actually going to be quite hard to accomplish?
A: That's a good question. So one of the things I didn't talk about so much is how do we make sure that the so related to the ELK challenge. So I said we're going to exploit the compositional generalization of language that we already get with modern LLMs to be able to ask questions about latent variables that are now expressed in that language. But what if latent variables meaning somehow drifts into some other thing? For some reason I think I can't completely rule out, I have arguments that it shouldn't. But we can devise probably evals specifically to make sure the semantics of these latent variables matches what say, a normal LLM would match, would say.
Q: Thank you for a super interesting talk. My question is related to that particular step. Just thinking from the adversarial perspective, the first thing that I think is once an adversary knows that you are trying to utilize this type of technique to transfer or modify the training data somehow so that you have statements of what's true and what is not, then the adversary can very easily manipulate the type of approach and kind of generate backdoors or poisoning attacks. So my question is how do you envision that step to be carry on? And how do you actually make it infer what's true and not true, given that a lot of the training data that we are feeding these models come from Internet and there's sometimes not really true or false.
A: So as I explained, when we're not sure we're going to use a syntax that says, oh, this was a statement found on this website at this date, or this author wrote this in this paper. So we're going to be very careful to only use data for which we know the truthness as labeled with this factual syntax. Now we can. One thing I didn't mention is because that's not going to give a lot of training data for the factual syntax, what we can do is because it's a latent variable model, I didn't have time to really go into this very much, which means it can explain itself. We can force it to use any of the latent variables that exist and there's an infinite number of them to explain any particular piece of data. So if there's a piece of data that we're not sure of. So somebody wrote X. We can use X is true as one of the potential causes, one of the latent variables that explain that somebody wrote X. And in this way we force the model at every step to commit about not just whether the person said that, but was it true what they said. And potentially other causes like did they have intentions and things like that on the data processing pipeline. You need to secure it. You need to make sure that when you collect it from a particular web page that it comes from that web page.
Another thing I want to say about poisoning is that I think a lot of the issues we have with poisoning is that current approach of training LLMs confuse code and data. So data becomes instructions. And in this world that I'm talking about, they are very easy to distinguish because we use different syntax for instructions. That should be things that you trust. Like this company wants this kind of outcome versus data, which is maybe what you read in a page, in a web page. So if in a web page somebody says do this, do that, well, it doesn't mean you have to do it. It means that somebody is writing it and there might be a reason for this. And so I think in this way you can deal with those poisoning attacks.
Q: Hi, sorry if you already mentioned this, but one confusion I have is if you have a prediction AI that's effective at predicting what other agents would do in the world, isn't this in some sense mean that it has the capability of agency itself?
A: Good question. So I needed an hour to explain everything. Yes, it's a very important part of the plan, how to deal with exactly this question. And that's where the truthification pipeline comes in. Because what it does is it separates the syntax for what would a human do, which is an action, what they would write next in this context, versus whether something is true or not. And what's the factual reality behind this?
Now, when you query the AI, you can use one of these two syntaxes. If you use a syntax, somebody would say X, then you're going to get the same thing as an LLM, right? You're going to get what a person would write, at least the Bayes LLM, right? If you use a syntax that's factual, like it is true that, blah, blah, blah, then you can make queries that don't correspond to actions. So now in your scaffold, in your system, on top of the predictor, you can turn it into an agent by asking about what an agent would do in order to achieve a goal but that is under our control. The problem is uncontrolled agency, right? And if you don't want that agency, you just need to use the factual syntax. Or you just don't ask about actions of agents, right? So if you do ask about actions of agents, now you have to face other problems like instrumental goals and so on. But at least you know that you're entering into that game. If you don't want to get into the agency game, you just don't ask questions about what an agent would do.
I want to update you on things I started talking about in previous presentations at previous meetings that has to do with the scientist AI research program. Maybe the main take home message of the update is I'm now more than ever convinced that it is possible to build machines that will be completely trustworthy and understand the world very well.
And so that's a path that is different from what we've done in AI in the past, which was trying to imitate human intelligence or using human intelligence as a template. Because in humans we can't disentangle our agency and our desires in the world from our understanding of it and our ability to make predictions and get things done.
There is another interesting connection for people who are familiar with the literature in AI safety and that is the question of eliciting latent knowledge, which hopefully I'll give you a glimpse of with the work that I'm doing.
Adam already gave us a lot of motivations. We don't want AI systems that are very smart and have goals that could be bad for us. But when I think a little bit deeper about this question, I come down to the issue of AI systems that are capable and have goals that we did not choose. And usually it's because these goals have been arising in an implicit way. When you learn from RL, you end up with some goals that were not necessarily spelled out explicitly and that contained the way, for example, that explicit goals in instructions will be interpreted. And then you also have other sources of implicit goals coming from pre-training because AIs are imitating people. And again, it's not explicit, but you get this.
So what does an AI need to cause serious harm, not by mistake, but really through having to deal with entities like us who don't want to die? Well, they need to have the capability to harm and they need to have the intention to harm. So uncapability. Well, we have evals, that's great. And we need to continue tracking that, but it's likely that capability is going to continue. So in the long run we need to figure out how to control these goals that they have, these intentions. And as a computer scientist, I like asking the extreme question, well, what about can we completely disentangle goals from understanding of the world? Which of course is not the case for humans.
And so I've been working on this research program inspired by sort of an idealized scientist, but it's like a theoretical scientist not an experimental scientist, a scientist who just tries to understand the data. The experimental data is not planning any experiments. We can use that later for planning experiments. But another way to think about it is the laws of physics. They don't care about you, they don't care about me. They don't have any preference for states of the world. If you were to run a very good simulator of the laws of physics, you would get something trustworthy. And so there's a sense in which it should be possible to build AI systems that are just as good or approximate, approximating something like a scientific understanding of the world and using those predictions in a way that is not biased by particular preferences for states of the world.
So one issue is that even if the AI understands how the world works, how do we answer those questions if the questions we are interested in do not correspond to directly observed random variables? So the issue that was pointed out by Christiano and collaborators with the ELK paper is that when we train, say, a graphical model, we have to give it examples of the random variables on which we can then query it. And if we don't have those examples, then we may not be able to get questions about the latent variables, the causes or the interpretations that are not directly observed. So I'll try to tell you a strategy to get around this.
The scientist AI is going to try to do that by learning a language, which is natural language, in which to express the latent variables so that we can make those latent variables interpretable. And the target for training the scientist AI is going to be the Bayesian posterior. So the Bayesian posterior, assuming somebody gives us a fairly uninformed flat prior, is just probability of some statement Y, which we care about, given some context X, and given all the dataset, and we're going to approximate that with a neural net or an AI system Q.
Now, the important thing is that the way these latent variables or these observed variables are expressed is with a statement in natural language, right? So this is not the usual thing you have in classical machine learning with probabilities, positive models where each node gets its meaning from the fact that we give examples for that supervised node. I mean observed nodes instead. The meaning of those nodes, those variables, is given by a piece of text that describes what that variable is supposed to represent.
And in fact, if you think about LLMs, well, they're kind of doing something like this already, right? So it's not crazy. But the interesting thing is, now the number of random variables we're talking about is exponentially large. It's all the things that you could express with little bit of natural language, all the properties of the world that you could express in this way. And you don't need to actually see all of those variables. You're going to get generalization, because generalization across language, you're going to be able to generalize to properties of the world that you've never seen simply because they are expressed in natural language. And you've seen other properties of the world expressed in natural language.
Now, there is a really important piece of the puzzle here in the research program, which we call this truthification pipeline. It's called truthification, because what we're going to do is we're going to take the same data that we use to train current frontier models, not just scientific data, but all data, including text that people write, and we're going to transform it into a bunch of statements so these latent variables, or observed variables rather in this case, that say something about a property of the world in such a way that all of those statements are true.
Now, of course, the problem is a lot of things people write are not true. They have intentions, they're misguided. And so we're going to introduce a syntax that turns the things we're not sure are true into things we're sure are true. So let's say that somebody writes a sentence X. We're going to use a different syntax. When we're sure that X is true, we could just write something like X is true. But when we're not sure because it's really a communication act, somebody wrote this, and maybe this is not true, then we use a different syntax that someone wrote X. So we just put a bit of syntactic sugar after having decided, based on the source of the data, whether these are things that we're sure of, like theorems that have been proven, or rather like tweets or even claims in a scientific paper, should be in the category of communication acts.
This has multiple goals, but one of them is of course, for the AI to understand what is trustworthy and what is not, so that it can come up with explanations for why did somebody write X, which might be different from X is true. But also then it allows at runtime to query the AI using the factual syntax so that we can get access to what the AI really thinks, rather than what a human would say in this context. And this is important from the point of view of making sure that the AI is not agentic.
Okay, so let's talk about agency. What is that? I'm going to make some particular definitions of agency argue why this is a good definition. So I'm going to say that an AI is agentic not just because it causes things in the world, but because it causes things robustly, that it has goals that it achieves in spite of randomness and in spite even of adversaries. In fact, the kind of AI agency that we want to avoid is an AI that is so agentic that it could defeat us, that in spite of our willingness to live, it somehow kills us. So this is a very high level of agency. And in general, we're going to think of agency as something graded where let's say zero means no agency at all, like the laws of physics. And some high level is going to correspond to something really dangerous if the goals are bad, of course.
So we're going to even think of how to quantify this notion of agency, this grade as a sort of number of bits of description that are needed to make sure the different actions that the AI is going to take actually achieve their goals. So to make this a bit more concrete in your mind, think about a very simple environment playing the game of Go, right? So I used to play Go. So when you play Go, typically if you want to win against a reasonable opponent at each move, you, you should really. There's only a couple of moves, 1 in 100, let's say, that are good, so you better choose them well. So you have 1% chance if you were to choose your moves randomly, you only have 1% chance at each step to make the right move. And you have to do this for about 100 moves, roughly. And so what's the chance, through random policies, that you're going to hit a winning strategy? It's 0.01 to the power 100. That's 10 to the minus 200. That's very, very small.
And so we're going to take the log of that and say that's the number of bits that you need to describe these winning policies. What's interesting here is that we can also use that as a score for agency. How complicated are the goals that the agent is able to achieve using that quantity? And now we can see that the volume in the space of all predictors, because we're going to train the predictor that corresponds to something agentics like bias predictions in order to make us do something bad, like give it more power, more compute, that volume of agentic predictors is going to be exponentially small in that score. And that's a very useful thing to keep in mind.
And in particular, it's going to have an impact on our ability to avoid this. What we're going to do when we train our scientist AI predictor is we're going to train it to make good predictions and more precisely to approach the Bayesian posterior, which I'm going to claim is not agentic. But on the way to this training objective, can we by chance end up picking an agentic predictor? And because the agentic predictors occupy such an exponentially small volume in the space of predictors, I'm going to claim that this is going to be very unlikely. And there's more to say about these arguments that I'm writing up. But that's really a key ingredient.
The other thing we need to be careful about is that precisely what we're optimizing and the details of how we're optimizing it is going to matter. We need to make sure that we're not optimizing, for example, for future prediction errors, future after deploying the AI, because otherwise again, there will be an incentive for the AI to create bias predictions that somehow influence the engineers to give it more compute and you know, for the AI to control humans so that it'll have access to data centers and all the bad things.
So there are two conditions here. One is there shouldn't be any interaction during training between the AI and the rest of the world, at least not a two way interaction. Of course the AI can use the data that it's given, but it shouldn't be doing online RL, for example, with respect to the prediction errors that it gets in the real world. And the second thing is it's not just the actual interactions, but we want to also avoid anticipated interactions. So the AI could be modeling the world and modeling the effect of its predictions being biased and potentially giving it ways to reduce future prediction errors. So that also is ruled out. And there are technical kind of definitions of how I think we can rule out both of these things that are actually quite feasible with like reasonable ways of setting up the training procedure.
The other thing we need to make sure is that the posterior, the target for optimization is going to be non agentic. I already mentioned the claim that if you had a really good simulator of the laws of physics, this would not really change based on the effects on those predictions. There's a technical thing here that you have to say that the model is able to anticipate what will happen in the future after conditioning on the particular deployment of the predictor. Maybe too detailed for this talk. And so using this as a basis, we can argue that both the final point of training as the amount of data goes to infinity, but also intermediate points which are these target posteriors for a finite amount of data really are non agentic.
And then going back to the question of ELK, the challenge of eliciting latent knowledge. I'm going to start by asking you to think of a predictor for say boolean events as a three way classifier, what I call a membership classifier, which says whether a statement given context is either true, false or unknown. So you can take the probability and just threshold if it's close enough to 0 or 1, you say it's true or it's false. If it's in the middle you say it's unknown. And now you can apply and it's trained on things that are true or false, but for sure you know that are true or false. That's the truthification pipeline. And so now you can apply learning theory to ask what happens asymptotically with enough data, enough compute, as your system generalizes better and better and something interesting happens, which is that it's going to learn to predict things that are true.
So what are the things that are true or false? For sure, for the most part is going to be things you can deduce from the things that you were given. Now the set of things you can deduce that is really limited by your compute power. So if you had infinite computer, you could deduce everything that can be deduced, but we don't. And so when we train the AI, we only give it examples of things that can be deduced with the sort of algorithms and compute that we have. And so now the AI could generalize about what is the correct deduction using the algorithms that we use. But it could invent new algorithms. It's really powerful and we go into superintelligence somehow and be able to make predictions that go beyond in terms of deductions, what we are able to do. But the good news is that wherever it chooses to generalize about true things that can be deduced, they are going to the worst thing that can happen. I guess that's the next slide.
The worst thing that can happen is that there are things it knows that can be deduced, but it doesn't tell us. And so it can choose depending on how it generalizes to only tell us at the worst case the things that humans would say can be deduced to be true with enough compute and so on. Or it could generalize in a broader way. But as a consequence, when it says that it doesn't know, you don't know if it's true or false. But when it says that it's true, true or false for sure, you can trust it. Right. There's a kind of asymmetry between the sort of honesty that we can get. And I call this a property of being epistemically correct. So now, when the AI issues a high confidence claim that something is true or false, it does not lie. Right. That's, of course, as symptomatic as the amount of data and compute increases.
One little parenthesis here is that the vast majority of questions we want to ask are not about things that we know for sure are true or false. Right. And so how do we ask questions about harm which are like, random in the future? Right. Well, the previous argument about epistemic caution not lying when you're highly confident only applies to things that are true or false, for sure, fundamentally. But that does not include uncertain things like future harm. But here's the good news. You can take an uncertain statement like something bad is going to happen in the future, and you can turn that into a certain statement by asking, what is the probability under the Bayesian posterior that this thing will happen? So you can take an uncertain statement like harm will occur. And it's not really either true or false. I mean, it has a probability. And so we can ask instead a question. We reformulate the question as the probability of that event is between 0.9 and 0.95. That's a binary statement, Boolean statement that is either true or false for sure. And if you have enough compute, you can know whether it's true or false. And so now it fits the previous arguments. And that means if we teach our AI about producing accurate confidence intervals about the things that are uncertain, we're going to get all of these asymptotic guarantees that I've been talking about, but not about whether something is true or false, which would be ridiculous, but about the probabilities of these things.
Okay, so going back to the ELK challenge, what we're saying is we're not actually solving the ELK challenge because there might be cases where the AI somehow knows things that are true and doesn't tell us, but when it does say with confidence that something is true, we can trust it. It's not a complete solution, but at least it's something that, from a safety perspective is sufficient. If the AI is not sure that something is going to be harmful or not, then we should be conservative and not do the action. But if it says that it's not going to be harmful, at least the probability is below a threshold. That's really the question we're going to get, then we can feel safe.
So that's one aspect. The other aspect is we're going to use the fact that the latent variables are expressed in natural language in order to give us a guarantee that we can query the AI about those variables that are uncertain. And maybe because they're not observed, or maybe they're not even observable. There's much more to be said about this argument, but that's sort of the gist of it. All right.
I want to mention that this research program I'm highlighting is what motivated me creating this nonprofit called LawZero that is trying to build such a thing and figure out the methodology for this. It's in Montreal, Canada, but we are recruiting for people also in Europe, in case you're from there. So thank you for your attention. Happy to take questions. I see a hand in the back.
Q: Thanks for the very interesting talk. I was wondering, even if we get an AI scientist that can very accurately predict the harms, if we are using it as sort of a judge or a threshold to decide what things should be deployed, that other agent would have to know something about utilities or human values. So I was wondering how you envision this being part of a greater pipeline.
A: So this is only a piece of the puzzle of building safe AI that include agency. But I feel like it's an important building block from which we can build safe agents. And I didn't tell you much except we can use it as a guardrail for unsafe, untrusted agents. Ultimately, we need to design those agents as well to make sure they don't exploit, for example, adversarial attacks on the guardrail scientist AI itself. And I have plans for that. And who defines what is right and wrong? Well, that should be a social choice, obviously, hopefully in a democracy. And this is a building block for how we can build more complicated systems that interact with the world, where humans are going to be the ones defining what are the red lines. And there are many more things that I didn't talk about. If we had something like this, for example, how do we use it to make sure that an AI agent doesn't have bad instrumental goals? There are lots of interesting questions, but that's for future talks.
Q: Thanks for the talk. You laid out a specific research trajectory with a couple components which, if executed well, could be quite interesting, right? Could produce an interesting system. If in a couple years part of this didn't work well, or like it was really hard, really sticky, which part do you think it'll be like? What part of this trajectory do you think is actually going to be quite hard to accomplish?
A: That's a good question. So one of the things I didn't talk about so much is how do we make sure that the so related to the ELK challenge. So I said we're going to exploit the compositional generalization of language that we already get with modern LLMs to be able to ask questions about latent variables that are now expressed in that language. But what if latent variables meaning somehow drifts into some other thing? For some reason I think I can't completely rule out, I have arguments that it shouldn't. But we can devise probably evals specifically to make sure the semantics of these latent variables matches what say, a normal LLM would match, would say.
Q: Thank you for a super interesting talk. My question is related to that particular step. Just thinking from the adversarial perspective, the first thing that I think is once an adversary knows that you are trying to utilize this type of technique to transfer or modify the training data somehow so that you have statements of what's true and what is not, then the adversary can very easily manipulate the type of approach and kind of generate backdoors or poisoning attacks. So my question is how do you envision that step to be carry on? And how do you actually make it infer what's true and not true, given that a lot of the training data that we are feeding these models come from Internet and there's sometimes not really true or false.
A: So as I explained, when we're not sure we're going to use a syntax that says, oh, this was a statement found on this website at this date, or this author wrote this in this paper. So we're going to be very careful to only use data for which we know the truthness as labeled with this factual syntax. Now we can. One thing I didn't mention is because that's not going to give a lot of training data for the factual syntax, what we can do is because it's a latent variable model, I didn't have time to really go into this very much, which means it can explain itself. We can force it to use any of the latent variables that exist and there's an infinite number of them to explain any particular piece of data. So if there's a piece of data that we're not sure of. So somebody wrote X. We can use X is true as one of the potential causes, one of the latent variables that explain that somebody wrote X. And in this way we force the model at every step to commit about not just whether the person said that, but was it true what they said. And potentially other causes like did they have intentions and things like that on the data processing pipeline. You need to secure it. You need to make sure that when you collect it from a particular web page that it comes from that web page.
Another thing I want to say about poisoning is that I think a lot of the issues we have with poisoning is that current approach of training LLMs confuse code and data. So data becomes instructions. And in this world that I'm talking about, they are very easy to distinguish because we use different syntax for instructions. That should be things that you trust. Like this company wants this kind of outcome versus data, which is maybe what you read in a page, in a web page. So if in a web page somebody says do this, do that, well, it doesn't mean you have to do it. It means that somebody is writing it and there might be a reason for this. And so I think in this way you can deal with those poisoning attacks.
Q: Hi, sorry if you already mentioned this, but one confusion I have is if you have a prediction AI that's effective at predicting what other agents would do in the world, isn't this in some sense mean that it has the capability of agency itself?
A: Good question. So I needed an hour to explain everything. Yes, it's a very important part of the plan, how to deal with exactly this question. And that's where the truthification pipeline comes in. Because what it does is it separates the syntax for what would a human do, which is an action, what they would write next in this context, versus whether something is true or not. And what's the factual reality behind this?
Now, when you query the AI, you can use one of these two syntaxes. If you use a syntax, somebody would say X, then you're going to get the same thing as an LLM, right? You're going to get what a person would write, at least the Bayes LLM, right? If you use a syntax that's factual, like it is true that, blah, blah, blah, then you can make queries that don't correspond to actions. So now in your scaffold, in your system, on top of the predictor, you can turn it into an agent by asking about what an agent would do in order to achieve a goal but that is under our control. The problem is uncontrolled agency, right? And if you don't want that agency, you just need to use the factual syntax. Or you just don't ask about actions of agents, right? So if you do ask about actions of agents, now you have to face other problems like instrumental goals and so on. But at least you know that you're entering into that game. If you don't want to get into the agency game, you just don't ask questions about what an agent would do.