Alignment

At FAR.AI, we aim to make advanced AI systems act in line with our intentions and values. AI systems are trained to optimize reward signals, which can often lead to unintended and harmful behaviors when they encounter real-world situations that differ from training. These misalignments may pose increasingly catastrophic risks as AI models become more powerful and autonomous in critical domains.

  • Reducing Deception in Language Models: We explore training LLMs with an internal "lie detector" to penalize deceptive outputs that human reviewers might miss, correcting for the model's tendency to tell pleasing falsehoods. We find this technique can produce genuinely honest models, not just better liars, under specific conditions like high detector accuracy and strong regularization. Learn more
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Alignment Research:

January 9, 2026

Large language models can effectively convince people to believe conspiracies

LLMs are persuasive across a variety of contexts, but it’s unclear whether this persuasive power advantages truth over falsehood. We ran three preregistered experiments where participants discussed a conspiracy theory with GPT-4o, which was instructed to either argue against (“debunking”) or for (“bunking”) that conspiracy, and found that GPT-4o was just as effective at increasing belief in conspiracies as decreasing it.

October 21, 2025

Emergent Persuasion: Will LLMs Persuade Without Being Prompted?

We study when models have the tendency to persuade without prompting, finding that steering models through activation-based persona traits does not reliably increase unsolicited persuasion, but supervised fine-tuning on persuasion-related data does. Notably, models fine-tuned only on benign persuasive content can become more likely to persuade on controversial or harmful topics

June 17, 2025

Why does training on insecure code make models broadly misaligned?

Why does training on insecure code make models broadly misaligned?

Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.

Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.

June 5, 2025

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.

Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.

July 19, 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Reinforcement learning from human feedback (RLHF) uses KL divergence regularization to mitigate reward errors, allowing high utility with light-tailed errors but suffering from reward hacking with heavy-tailed errors. While current models have light-tailed errors, real-world applications may still face significant risks from heavy-tailed errors, leading to catastrophic Goodhart.

October 19, 2023

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

VLM-RM: Specifying Rewards with Natural Language

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better in the future.

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.

March 28, 2023

Training Language Models with Language Feedback at Scale

We present a novel method called Imitation learning from Language Feedback (ILF) to tackle the problem of pretrained language models producing outputs misaligned with human preferences. ILF leverages more informative language feedback through a three-step iterative process: (1) conditioning the language model on input, initial output, and feedback, (2) generating refinements and selecting the one that incorporates the most feedback, and (3) finetuning the language model based on the chosen refinement. Experimental results indicate that ILF effectively scales with dataset size and achieves human-level summarization performance by learning from both language and comparison feedback.

February 16, 2023

Pretraining Language Models with Human Preferences

We explore the problem of how to train language models to generate text that humans would not consider inappropriate. We find that conditional training, which learns the distribution over tokens based on human preference scores, is a simple and effective approach that reduces undesirable content while maintaining downstream task performance. Pre-training LMs with human feedback leads to better preference satisfaction than traditional LM pre-training followed by feedback-based finetuning.

November 17, 2022

Training Language Models with Language Feedback

We propose to learn from natural language feedback, which conveys more information per human evaluation than comparisons. We propose doing so with a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input.

September 22, 2022

imitation: Clean Imitation Learning Implementations

We describe a software package called "imitation" which provides PyTorch implementations of several imitation and reward learning algorithms, including three inverse reinforcement learning algorithms, three imitation learning algorithms, and a preference comparison algorithm.

August 8, 2022

RL with KL penalties is better viewed as Bayesian inference

We discuss the use of reinforcement learning (RL) in fine-tuning large language models to penalize undesirable features in generated sequences. We argue that the standard RL approach is flawed and leads to distribution collapse, and propose a Bayesian inference view of KL-regularized RL, which explains how it avoids the distribution collapse problem.
No items found.