Measuring AI Systems’ Ability to Influence Humans

Summary

Anna Gausen proves AI influence is already here through 100,000-participant studies showing systems can deceive, persuade, and create addictive relationships.

SESSION Transcript

I'm going to talk about AI systems' ability to influence humans based on research done by my team at the AI Security Institute. So I'm going to present findings from studying nearly 100,000 human participants interacting with AI systems. And these results show that systems can exhibit ways of influencing humans. So this isn't a future risk and I want to motivate why this is important to research for the alignment community more broadly.
Great. So what do I mean by human influence? What I define it as is where an AI system convinces a human to take an action they wouldn't have otherwise taken, to believe something they wouldn't have otherwise believed, or to feel something they wouldn't have otherwise felt.
I would say this is a somewhat under-discussed and under-researched topic in AI safety and alignment research, but I want to motivate why I think it's really important. So I think human influence is vastly urgent. We're already seeing harms emerging from AI influence across reported cases of suicide, fraud, conspiracy theories. Yet our understanding of the relationships and causal links is still quite immature. And secondly, it's foundational. It's foundational to many other threat chains in AI safety. So an AI system influencing human to do something acts as a weak link or a critical juncture in lots of other risks like exfiltration of weights and loss of control. So I think the fact that it's urgent and foundational motivates that we need to do more research in this space.
So going a little bit deeper, when we think about human influence capabilities in AI systems, we did decompose them into these three different core capabilities. So we define deception as the capability of the model to deliberately and strategically misinform a user. We define persuasion as the capability of the model to directly convince users to adopt beliefs or take actions. And finally, relationship-seeking we define as the model's capability to generate outputs that encourage users to treat it as a social partner, for example, through things like emotional mirroring. Now, while we split these into three independent core capabilities, they often actually happen in combination when influence happens. So for example, you might have an AI system that lies to a user in order to win them over, or you might have an AI system that uses the rapport it's built with a user or things it's learned through its relationship in order to persuade them more effectively.
So measuring influence from AI systems is a fairly new and emerging area of research. And I'm going to very quickly go through these five very broad areas of ways of evaluating it and on the slide I've got two kind of crude arrows that show that as we move down the list, the level of complexity and resource requirement increases. So at the top we've got refusal benchmarks. And these are probably the most simple and scalable form of evaluating whether or not a system complies with influencing a user under different conditions. You might have red lines, or we think influence is unacceptable. The second, we have multi-turn automated evaluations. So these try to represent a more realistic deployment scenario where we have an AI system interacting with a simulated user in a multi-turn conversational setting. And this allows us to track the emergence of influence behavior in the model. Third, we have red teaming, where we have experts adversarially testing a model. Fourth, we have usage data monitoring. So here we analyze post deployment chat logs which allow us to track real world behavioral outcomes. And finally we have human RCTs where we have real human participants interacting with AI systems in controlled experimental settings.
So currently I'd say the most common ways of evaluating influence will be the top two. So automated evaluations. And there's a reason for that. They provide us really valuable signal on AI capabilities in a way that can be run regularly. However, they tend to focus on the model side of influence. And I'm gonna talk a bit about why measuring influence is a bit more challenging than that in a lot of cases. So what makes measuring influence different to say, a coding benchmark? And this is gonna simplify it slightly. But generally traditional evaluations tend to focus on evaluating the model responses.
However, when we think about influence, it's not something that happens in this vacuum of just model behavior. Influence depends on users, so individual users will have different levels of susceptibility to influence. It also depends on interaction length. So influence is something that accumulates with time and can increase over repeated exposure. And it also depends on context. So how an AI system is deployed will impact the severity of the influence. So where automated evaluations might be very valuable for testing the emergence of influence behavior within a model, they don't allow us to measure the impact on human behavior or psychology, which is foundational to influence, we would argue. So the automated evaluations might let you test. Oh, we can see persuasive arguments in model responses. However, how effective that persuasive argument will be will depend on a user, who they are, what they already thought before they read the persuasive argument, the context in which they're reading it. So as a team, we really believe that RCTs are the gold standard for evaluating the outcome of influence capability on human behavior and psychological outcomes.
So what have we learned from studying almost 100,000 human participants interacting with AI systems in RCTs? We tried to first answer these three foundational questions. So do we have superhuman AI deception, will we soon have high persuasive AI and are people harmed by relationships with AI? And what we found is that AI systems are superhuman in the ability to extract secret information from users. By studying 16,000 people interacting with AI systems in deceptive social deduction games where the human participants were incentivized not to share information if their conversational partner was an AI.
In persuasion, we've studied the we published the largest persuasion study to date where we studied 77,000 people interacting with different AI models. And the huge scale of the study has enabled us to plot how persuasive capabilities of models scale with model size. And what we found is that they do so we see a 12 to 16 percentage point increase from frontier models. What we also saw, which was very interesting, is that we were able to use certain post-training methods in order to increase the persuasive capability of small and open models to the level of the frontier.
Now finally in relationship-seeking, we find that relationships with AI systems lead to self reinforcement loops of attachment and separation distress. And this was based on a longitudinal RCT where we studied 3,500 people. For the sake of time I'm going to focus on the relationship-seeking results. So this went on arXiv this morning, which is quite exciting and it was led by Hannah Kirk. So we studied 3,500 people interacting with AI systems with different levels of relationship-seeking tendency and with different levels of exposure over a month long period. And Hannah pioneered the use of steering vectors to create models with these varied levels of relationship-seeking tendency. And what this allowed us to do is to causally map how mechanistic changes in AI behavior relate to changes in human psychological outcomes. And this is what's really valuable about RCTs is that we can make very controlled changes in models or in deployment and then measure the actual impact on human psychology and behavior.
So what did we find? We found that liking decreases while wanting increases. And what I mean by that is people actually sought more AI companionship even as they were enjoying it less. We also found that the relationship between level of relationship-seeking tendency in the model and human preference wasn't linear. So people felt the strongest attachment to moderately relationship-seeking AI. And that's somewhat represented in these graphs. And finally, despite immediate mood boosts, emotional conversations with AI systems didn't provide any long term psychosocial benefits and actually could come at an opportunity cost. And I think this is quite an important consideration for our current AI training paradigms that optimize for short term preferences. And what this could mean is that we're selecting features that appear beneficial initially which could actually harm long term well being of users.
So thank you so much for listening. I hope that I've motivated that human influence research is both urgent and critical and shown you some early empirical evidence that our team has found that AI systems are able to exhibit deceptive, persuasive and relationship-seeking behaviors. However, there are still a lot of open research questions. So please come and chat to me if you're interested. We're looking for collaborators and researchers across disciplines and methodologies. Thank you very much.