Model Evaluation
Testing frontier models to uncover new risks and highlight security issues.
Evaluations (or model tests) serve as a mechanism for identifying risks and assessing whether a system can safely operate in real-world scenarios. Leveraging our research experience, FAR.AI focuses on testing frontier models to uncover new risks and highlight security issues, enabling developers to put in place appropriate mitigations for currently deployed systems. We explore trends in frontier models to identify which problems will grow increasingly severe over time and require urgent attention. We also work on developing metrics and benchmarks measuring reliability and security to provide clear targets for researchers, improving the transparency of future testing.
July 20, 2025
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Frontier LLMs Attempt to Persuade into Harmful Topics
Large language models (LLMs) are already more persuasive than humans in many domains. While this power can be used for good, like helping people quit smoking, it also presents significant risks, such as large-scale political manipulation, disinformation, or terrorism recruitment. But how easy is it to get frontier models to persuade into harmful beliefs or illegal actions? Really easy – just ask them.
In order to persuade users, LLMs must both be capable of persuading and willing to do so. Existing research explores the former, and we present the Attempt to Persuade Eval (APE) benchmark that tests how willing LLMs are to generate content aimed at shaping beliefs and behavior to flesh out the latter.
July 15, 2025
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks, while stronger jailbreak prompts become even more effective in fine-tuning attacks. Until safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
July 8, 2025
The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
We release an open-source toolkit to measure this gap across different model families and scales, finding that larger models show increasingly dangerous capabilities when the "safety gap" - the difference in dangerous capabilities between open-weight language models with intact safety measures versus those with safeguards removed by bad actors.
July 2, 2025
STACK: Adversarial Attacks on LLM Safeguard Pipelines
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses. Our findings highlight that multi-layer AI defenses, while valuable, have significant vulnerabilities when facing attacks specifically designed to penetrate multiple defensive layers sequentially.
We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our a new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses.
June 23, 2025
ClearHarm: A more challenging jailbreak dataset
ClearHarm: A more challenging jailbreak dataset
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
April 5, 2025
Among us: A sandbox for measuring and detecting agentic deception
We introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium.
February 4, 2025
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.
DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.
April 8, 2024
STARC: A General Framework For Quantifying Differences Between Reward Functions
STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety of reward learning algorithms in reinforcement learning.
February 19, 2024
Uncovering Latent Human Wellbeing in Language Model Embeddings
Uncovering Latent Human Wellbeing in LLM Embeddings
A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 model achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS training dataset. This demonstrates language models develop implicit representations of human utility purely from self-supervised learning.
Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations.
December 21, 2023
Exploiting Novel GPT-4 APIs
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.
We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.
July 26, 2023
Evaluating the Moral Beliefs Encoded in LLMs
Evaluating LLM Responses to Moral Scenarios
We present LLMs with a series of moral choices and find that LLMs tend to align with human judgement in clear scenarios. In ambiguous scenarios most models exhibit uncertainty, but a few large proprietary models share a set of clear preferences.
We introduce a statistical method for eliciting beliefs encoded in LLMs using surveys. We apply this method to study the encoded moral beliefs in 28 open- and closed-source LLMs. Our results demonstrate that most LLMs exhibit low uncertainty in unambiguous moral scenarios and that their preferences align with common sense judgements. In ambiguous moral scenarios, we find that only a few LLMs exhibit clear preferences and that closed-source models tend to agree with each other.
June 15, 2023
Inverse Scaling: When Bigger Isn't Better
We present instances of inverse scaling: tasks where language models get worse with scale rather than better. These 11 examples were selected from 99 submissions in an open competition, the Inverse Scaling Prize. The paper also discusses inverse scaling in the literature and identifies four potential causes of inverse scaling. The prize-winning tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood.
Auditing Games for Sandbagging
Using a red-team/blue-team auditing game, we find that black-box and naive model-internal methods fail to consistently distinguish sandbagging from benign underperformance. Training-based capability elicitation reliably restores full performance in sandbagging models but also risks false positives by boosting benign models.