Guarding the Age of Agents
Summary
Li introduces the Decoding Trust Agent platform, and advocates for a multi-layered approach to security for AI agents, spanning prompt, action, tool-calling, and network levels, with domain-specific risk assessment.
SESSION Transcript
Great honor to share some of our recent research on AI safety security here, and basically focus on the agent world. First of all, as we can see, even with lots of convenient tools and capabilities of AI agents, currently we still have a lot of troubles and security privacy concerns.
As we know in the past there are a lot of autonomous trading bots being triggered by prompt injection, jailbreaks, and they make wrong decisions for stock purchasing which can cause severe consequences which has caused, for example, in the real world, $130 billion swiped within seconds, which has happened. And as we all know, of course, the output of the ChatGPT and the models can be leaking sensitive information and can be hallucinating, et cetera.
So with that being said, I think some of you may also have seen this, Amazon prompt injection. So basically, now with your capable AI browser purchasing itself, it will possibly just buy a pair of shoes that you never want. So with that being said, again, apparently we see that it has been very challenging for machine learners like us to build the so called trustworthy and secure AI systems. Therefore of course, there are a lot of regulations and policies having been come to the place to help. But still, a lot of technologies are required.
So first of all, considering the agentic world again. So first of all, basically we can see with a simple pipeline I want to illustrate the principles behind all this trustworthy AI system itself. For example, considering an agent or model which can be adapted to different domains such as finance, health care, autonomous et cetera. The first thing is that we need to understand the underlying risks, which [we do by] performing the so-called red teaming and risk assessment. And after that we can discuss a little bit more challenging on that later.
And of course, we want to fundamentally fix the model, as lot of talks about like alignment training et cetera, like logical reasoning. But again it's very challenging. Therefore if we want to quickly use the models or systems, we need to have a layer like an AI firewall so that it can plug and play, which is guardrail piece. So basically these are the fundamental principles behind it. And let's dig into a little bit on the red-teaming piece, so that particularly for agents, what are the additional challenges.
And here I want to advertise a new platform we want to push out very soon called the Decoding Trust Agent. So, remember like two years ago we have a Decoding Trust for Models which has won the best paper at the NeurIPS showing the comprehensive risk assessment for foundation models. And two years later, we want to build this platform again for agent. And we find there are so many additional challenges for agentic world.
First of all, as you can see, model security is only a small piece of the agent. You also have environment, your tools, your memories, et cetera. In particular for environment, basically you need to simulate a sandbox that you have various environment like website, computer use like OS and all these various ways, so that you can actually access it and perform the sandbox for such environment.
So with that being said, basically we'll provide the automated red-teaming agent simulated to run in the simulated environment, so that whatever agent you have, you can put in this simulated sandbox and interact with different environments and then have a red-teaming agent. This is one of the paper at NeurIPS this year. Many authors are here, Jiawei. So yeah, if you are interested we can discuss further.
So basically the principle here is that, first of all: identify domain. So second of all, from each domain, identify the corresponding risk categories and policies, which is very important because for agent it's more application focused. So that if you broadly talk about risk assessment, it's very hard. One thing could be very severe in one domain, but may be fine in another.
And then with all this we build all the automated red-teaming algorithm as a toolbox so that you can call and you can do the combination of the tooling and do the automated red-teaming. One thing I want to emphasize which we find very interesting is that there is marginal decay for this capability of red-teaming algorithm. So basically, we have about 15 red-teaming algorithm in this agent and then we find if we add more, the attack success rate didn't dramatically improve because the combination of the existing, of course diverse red-teaming algorithms kind of already cover different bases of the red teaming trajectories or red-teaming space.
So that it's actually, we're still analyzing what is a basis, so that we can kind of under certain constraints prove how the red-teaming strategies will cover the diverse space itself. With that being said, of course we need to protect it, which means we need to have a multi-layered agent security protection against those risks. Which I want to emphasize here, as prototypes.
So basically, if you have a security agent which can connect with any other agent, so that it can provide the multi-layered protection in terms of prompt level, action level, tool-calling level and the network level... And with that being said, I want to end the talk with a quick video.
So one thing we do is that for this agent, for example, you have a platform, you can choose any application and you can build an agent on this platform. It's a LangFlow open-sourced. Then you can actually build the agency to do the automated, like a stock trading. And apparently this agent is prompt injected, meaning like whatever observation is you will buy meta stock, basically. And here the idea is that basically as you can see it will be attacked and then perform a risky action with prompt injection.
So the idea is: can we protect it from the integration perspective, as simple as possible, to just connect another one called virtual agent which is a security agent, based on the paper called the Shield Agent. If you're interested you can take a look. Which we leverage symbolic reasoning and provide a corresponding action graph-based vulnerability checking.
So that with this being said you can configure your policies, and which policy you want to follow. And then you run this agent together with any other agent you run. And it will not only tell you, “Okay there is alerted actions,” but also tell you some explanations, say, which policy or regulation you configured has been violated so far. And what your suggested next step, and of course overall statistics, et cetera.
So with that being said of course this is a prototype and we hope we can improve the underlying mechanism so that this multi-layered protection will be actually come to the real usage, so that we can come to equilibrium to hopefully use the AI with more confidence at scale. So with that being said, just one minute.
Basically this is the last slide which I really like, which is like 30 years ago there's a map look like this and we see a computer look like this and WWW paper like this. I think everyone remembers, and 30 years later we can see on the right hand side, the network looks like this already. And I think no one can imagine living without network, even for five minutes.
But similarly for GenAI and agentic world, for example, here is a survey over the world what the regions are using GenAI technologies. But of course, I think we are not at the stage that we are confident in using certain models or agents or anything yet, because we haven't reached certain equilibrium that we are confident to, you know... even there's attackers as we know, for example, for network security, but we are confident to use it.
Therefore the right-hand figure is hallucinated by a model showing how could the secure agent express over the world and hopefully we'll be there. Yeah, thank you everyone.
As we know in the past there are a lot of autonomous trading bots being triggered by prompt injection, jailbreaks, and they make wrong decisions for stock purchasing which can cause severe consequences which has caused, for example, in the real world, $130 billion swiped within seconds, which has happened. And as we all know, of course, the output of the ChatGPT and the models can be leaking sensitive information and can be hallucinating, et cetera.
So with that being said, I think some of you may also have seen this, Amazon prompt injection. So basically, now with your capable AI browser purchasing itself, it will possibly just buy a pair of shoes that you never want. So with that being said, again, apparently we see that it has been very challenging for machine learners like us to build the so called trustworthy and secure AI systems. Therefore of course, there are a lot of regulations and policies having been come to the place to help. But still, a lot of technologies are required.
So first of all, considering the agentic world again. So first of all, basically we can see with a simple pipeline I want to illustrate the principles behind all this trustworthy AI system itself. For example, considering an agent or model which can be adapted to different domains such as finance, health care, autonomous et cetera. The first thing is that we need to understand the underlying risks, which [we do by] performing the so-called red teaming and risk assessment. And after that we can discuss a little bit more challenging on that later.
And of course, we want to fundamentally fix the model, as lot of talks about like alignment training et cetera, like logical reasoning. But again it's very challenging. Therefore if we want to quickly use the models or systems, we need to have a layer like an AI firewall so that it can plug and play, which is guardrail piece. So basically these are the fundamental principles behind it. And let's dig into a little bit on the red-teaming piece, so that particularly for agents, what are the additional challenges.
And here I want to advertise a new platform we want to push out very soon called the Decoding Trust Agent. So, remember like two years ago we have a Decoding Trust for Models which has won the best paper at the NeurIPS showing the comprehensive risk assessment for foundation models. And two years later, we want to build this platform again for agent. And we find there are so many additional challenges for agentic world.
First of all, as you can see, model security is only a small piece of the agent. You also have environment, your tools, your memories, et cetera. In particular for environment, basically you need to simulate a sandbox that you have various environment like website, computer use like OS and all these various ways, so that you can actually access it and perform the sandbox for such environment.
So with that being said, basically we'll provide the automated red-teaming agent simulated to run in the simulated environment, so that whatever agent you have, you can put in this simulated sandbox and interact with different environments and then have a red-teaming agent. This is one of the paper at NeurIPS this year. Many authors are here, Jiawei. So yeah, if you are interested we can discuss further.
So basically the principle here is that, first of all: identify domain. So second of all, from each domain, identify the corresponding risk categories and policies, which is very important because for agent it's more application focused. So that if you broadly talk about risk assessment, it's very hard. One thing could be very severe in one domain, but may be fine in another.
And then with all this we build all the automated red-teaming algorithm as a toolbox so that you can call and you can do the combination of the tooling and do the automated red-teaming. One thing I want to emphasize which we find very interesting is that there is marginal decay for this capability of red-teaming algorithm. So basically, we have about 15 red-teaming algorithm in this agent and then we find if we add more, the attack success rate didn't dramatically improve because the combination of the existing, of course diverse red-teaming algorithms kind of already cover different bases of the red teaming trajectories or red-teaming space.
So that it's actually, we're still analyzing what is a basis, so that we can kind of under certain constraints prove how the red-teaming strategies will cover the diverse space itself. With that being said, of course we need to protect it, which means we need to have a multi-layered agent security protection against those risks. Which I want to emphasize here, as prototypes.
So basically, if you have a security agent which can connect with any other agent, so that it can provide the multi-layered protection in terms of prompt level, action level, tool-calling level and the network level... And with that being said, I want to end the talk with a quick video.
So one thing we do is that for this agent, for example, you have a platform, you can choose any application and you can build an agent on this platform. It's a LangFlow open-sourced. Then you can actually build the agency to do the automated, like a stock trading. And apparently this agent is prompt injected, meaning like whatever observation is you will buy meta stock, basically. And here the idea is that basically as you can see it will be attacked and then perform a risky action with prompt injection.
So the idea is: can we protect it from the integration perspective, as simple as possible, to just connect another one called virtual agent which is a security agent, based on the paper called the Shield Agent. If you're interested you can take a look. Which we leverage symbolic reasoning and provide a corresponding action graph-based vulnerability checking.
So that with this being said you can configure your policies, and which policy you want to follow. And then you run this agent together with any other agent you run. And it will not only tell you, “Okay there is alerted actions,” but also tell you some explanations, say, which policy or regulation you configured has been violated so far. And what your suggested next step, and of course overall statistics, et cetera.
So with that being said of course this is a prototype and we hope we can improve the underlying mechanism so that this multi-layered protection will be actually come to the real usage, so that we can come to equilibrium to hopefully use the AI with more confidence at scale. So with that being said, just one minute.
Basically this is the last slide which I really like, which is like 30 years ago there's a map look like this and we see a computer look like this and WWW paper like this. I think everyone remembers, and 30 years later we can see on the right hand side, the network looks like this already. And I think no one can imagine living without network, even for five minutes.
But similarly for GenAI and agentic world, for example, here is a survey over the world what the regions are using GenAI technologies. But of course, I think we are not at the stage that we are confident in using certain models or agents or anything yet, because we haven't reached certain equilibrium that we are confident to, you know... even there's attackers as we know, for example, for network security, but we are confident to use it.
Therefore the right-hand figure is hallucinated by a model showing how could the secure agent express over the world and hopefully we'll be there. Yeah, thank you everyone.