FAR.AI - Model Evaluation Research

July 20, 2025

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Frontier LLMs Attempt to Persuade into Harmful Topics

Large language models (LLMs) are already more persuasive than humans in many domains. While this power can be used for good, like helping people quit smoking, it also presents significant risks, such as large-scale political manipulation, disinformation, or terrorism recruitment. But how easy is it to get frontier models to persuade into harmful beliefs or illegal actions? Really easy – just ask them.

In order to persuade users, LLMs must both be capable of persuading and willing to do so. Existing research explores the former, and we present the Attempt to Persuade Eval (APE) benchmark that tests how willing LLMs are to generate content aimed at shaping beliefs and behavior to flesh out the latter.

‍

July 15, 2025

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks, while stronger jailbreak prompts become even more effective in fine-tuning attacks. Until safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.

July 8, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

We release an open-source toolkit to measure this gap across different model families and scales, finding that larger models show increasingly dangerous capabilities when the "safety gap" - the difference in dangerous capabilities between open-weight language models with intact safety measures versus those with safeguards removed by bad actors.

July 2, 2025

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses. Our findings highlight that multi-layer AI defenses, while valuable, have significant vulnerabilities when facing attacks specifically designed to penetrate multiple defensive layers sequentially.

We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our a new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses.

June 23, 2025

ClearHarm: A more challenging jailbreak dataset

We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.

‍

April 5, 2025

Among us: A sandbox for measuring and detecting agentic deception

We introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium.

February 4, 2025

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

April 8, 2024

STARC: A General Framework For Quantifying Differences Between Reward Functions

STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety of reward learning algorithms in reinforcement learning.

February 19, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 model achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS training dataset. This demonstrates language models develop implicit representations of human utility purely from self-supervised learning.

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations.

December 21, 2023

Exploiting Novel GPT-4 APIs

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.

July 26, 2023

Evaluating the Moral Beliefs Encoded in LLMs

Evaluating LLM Responses to Moral Scenarios

We present LLMs with a series of moral choices and find that LLMs tend to align with human judgement in clear scenarios. In ambiguous scenarios most models exhibit uncertainty, but a few large proprietary models share a set of clear preferences.

We introduce a statistical method for eliciting beliefs encoded in LLMs using surveys. We apply this method to study the encoded moral beliefs in 28 open- and closed-source LLMs. Our results demonstrate that most LLMs exhibit low uncertainty in unambiguous moral scenarios and that their preferences align with common sense judgements. In ambiguous moral scenarios, we find that only a few LLMs exhibit clear preferences and that closed-source models tend to agree with each other.

June 15, 2023

Inverse Scaling: When Bigger Isn't Better

We present instances of inverse scaling: tasks where language models get worse with scale rather than better. These 11 examples were selected from 99 submissions in an open competition, the Inverse Scaling Prize. The paper also discusses inverse scaling in the literature and identifies four potential causes of inverse scaling. The prize-winning tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood.

Auditing Games for Sandbagging

Using a red-team/blue-team auditing game, we find that black-box and naive model-internal methods fail to consistently distinguish sandbagging from benign underperformance. Training-based capability elicitation reliably restores full performance in sandbagging models but also risks false positives by boosting benign models.

Model Evaluation

Model Evaluation Research:

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Frontier LLMs Attempt to Persuade into Harmful Topics

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

ClearHarm: A more challenging jailbreak dataset

ClearHarm: A more challenging jailbreak dataset

Among us: A sandbox for measuring and detecting agentic deception

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

STARC: A General Framework For Quantifying Differences Between Reward Functions

Uncovering Latent Human Wellbeing in Language Model Embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

Exploiting Novel GPT-4 APIs

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

Evaluating the Moral Beliefs Encoded in LLMs

Evaluating LLM Responses to Moral Scenarios

Inverse Scaling: When Bigger Isn't Better

Auditing Games for Sandbagging