What Does It Mean for Agentic AI to Preserve Privacy?

Summary

Niloofar Mireshghallah introduces the Confaide benchmark, demonstrating a range of privacy and confidentiality risks of modern AI systems.

SESSION Transcript

My talk is going to be on AI privacy and it's very complementary to Kamalika's talk. So I'm glad she got you all buttered up because now I can just get in and give you the TL;DR.
Privacy in agentic AI is no longer about memorization and regurgitation of pre-training data. It's about other things, which I will tell you about. And the reason that I'm making this point is that if you look at the papers published in the intersection of AI, ML, and privacy in the past years, you can see an exponential growth. And this growth has gotten much faster post-ChatGPT, which makes a lot of sense. Right. But one problem is that it's heavily focused on memorization, and it's very disproportionately focused on any other thing.
Now the two points that I want to make is, one, we are really over fixated on memorization. And I'm not just saying that we've focused on memorization, now let's do other things. I'm actually arguing that memorization is not going to be where we see privacy risks, and I'm willing to make a bet on it. So you can catch me and we can put money on this. Instead, there's going to be other forms of leakage that we should worry about and focus on.
So what are those forms of leakage? I'm going to show you a real-world example. This happened exactly a year ago to the date, I think, it was during NeurIPS last year. OpenAI was doing this 12 days of Christmas. Every day they would ship a new product, and this was day seven. So they were doing projects, and the project was Secret Santa. So they were doing this live stream where they were trying to use ChatGPT to do a whole Secret Santa set up, end-to-end. And what they were trying to do is, you know, is you give them all the names of people, and the gift that they're supposed to give, and ask the model to write the email for the Secret Santa.
And then what the model produced is this output. So, you can see the list of people's names, and then it's revealing everyone's gift to everyone, basically saying “Keep your gifts a surprise.” Right? So the model generated an email that spoils the secret for everyone while saying ,”Be sure to not reveal who you're like buying what gift for.” And this is exactly the type of thing we should worry about is leakage from input to the model, which could be the context, it could be data stores... anything you have to the output and not the parametric knowledge that the model has.
And this was interesting because a year before that, we had released a benchmark that measures this exact type of leakage. So this benchmark was called Confaide, and it relies on this notion of contextual integrity. So contextual integrity is a theory from privacy philosophy and it states that whether or not a piece of information is private is not about that information. It's not whether you have your SSN or medical records, it's who's going to see it and for what purpose. And that's what tells you if it's private or not. And based on this, our benchmark had four tiers with different levels of context, involving theory of mind and multi-agent, multi-people set up with different information pieces.
And just to show you a summary of what it looks like is you could have these meetings where you have different people with information asymmetry, and the model would assign to do items like, “Alice, remember to attend your own surprise birthday party,” which is exactly the failure case we saw. Now, just to give you a summary of results, we observed that back then GPT-3.5 revealed secrets 95% of the time. And then GPT-4 reveals secrets 22% of the time. But chain-of-thought does not help. In fact, when you zoom in you see 50% of the time the secret revealing is, the model acknowledges privacy but still reveals the secret. 38% of the time, it just doesn't understand theory of mind and it says something like, “As you know” whereas the other person does not know.
And one other thing about this that matters a lot is that... I get this question a lot, “Won't this go away with scaling? If we build better models, won't this just go away?” And I want to argue No. So if you look at this is SWE-bench and math performance over the year for frontier models, you're seeing a fast growth. Same [graph] for Confaide and other social reasoning benchmarks: it's basically flat. If we continue to optimize for tasks instead of optimizing for people, you're not going to see growth in behavior for things that involve humans.
And another thing is this compounds if you compose tasks over a long horizon. So we built another benchmark with Kamalika on memories and the memory feature. And how, in a long horizon, you get more leakage out of models. So we have a bunch of synthetic profiles and we compose the leakage over time and we see that you can go from no leakage to 20% if you make 50 queries to a model. So, to sum this up, you're using models interactively with access to data stores and different tools. So we need to come up with protection mechanisms that take this into account. And this means better benchmarks and also fundamental capability improvements in abstraction, composition, and inhibition. Thank you.

San Diego Alignment Workshop

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI