Why Language Models Hallucinate

Summary

Santosh Vempala argues that post-training encourages hallucination by penalizing "I don't know" responses equally to errors, and proposes behavioral calibration as a solution.

SESSION Transcript

I'm going to tell you why language models hallucinate. So you shouldn't believe a word I say. This is work with Adam Kalai, the top research scientist at OpenAI, and his colleagues Ofir Nachum and Eddie Zhang. It's entirely theoretical.
And this is also Adam, not a hallucination. So for the purpose of this talk, let's think of generations, whatever language model outputs as being labeled valid, or not valid, which would be hallucinations.
And there are two takeaways. The first is that pre-training provably encourages hallucinations. I'll go into that in some detail. And the second is that post-training, we suggest, should penalize errors more than “I don't know.”
So to understand this, let's think of two different things you could do with a language model. The standard one is generation, where you'd like to generate a text or complete a sequence with a probability the same as the training distribution. The second is classification, where you're given texts that are labeled as valid or not valid. And this is a classical supervised learning problem where you want to learn to classify a new text, label a new text, as being valid or not valid.
So how could you use a language model to as a classifier? Very simple. You just look at the probability it assigns to a particular text, and if it's above a threshold, you call it valid, otherwise it's invalid. Okay, now the theorem says that the hallucination rate of any language model is going to be at least twice the misclassification rate on this supervised learning problem, plus with an error term.
And roughly speaking, this means that if you can't distinguish valid from invalid, you will hallucinate. Now this can't be true, because the model could just say “I don't know” all the time, then it would not be hallucinating. So, let's look at the theorem more carefully.
So the term on the left is the probability of hallucination — producing something that's not valid — any language model that was trained on valid data. The first term on the right is the misclassification error, so, the probability that the threshold classifier makes an error in labeling it valid or not valid on the distribution, which is assigning half its probability to true valid statements from the same distribution and half uniformly among hallucinations. Okay? And then we have two correction terms.
The first one is just the fraction of valid inputs. Of course, if everything is valid, then you're not going to hallucinate. And the last one is the difference in the probability between the model and learned model and the training distribution on the model's ‘high confidence set.’
About that term: this is a type of calibration which itself is a weak form of accuracy. Any coarsening of distribution will be calibrated with it, and we only need it for the set of generations that are above a certain threshold. Now it turns out that simple calculation will show you that any local optimum of the log loss for for next-word prediction is going to be calibrated. It's going to have this calibration.
So the theorem is really saying that the hallucination rate is at least twice the misclassification rate, minus the calibration error, this particular calibration error. And now we're saying pre-training, which leads to no calibration error, will lead you to hallucination.
And this indeed you can see in for example the OpenAI technical report where the pre-trained base model is very nicely calibrated and the post-training reduces hallucination, of course, as it should, and also calibration. So the point of the theorem is that it allows you to connect reasons for hallucination to reasons for misclassification, which are much better understood.
Could be sample complexity. And this gives us a corollary about, for example, arbitrary facts where you can say the hallucination rate is at least the fraction of valid data that you haven't seen in training. Or it could be because of a poor model, or it could be computational complexity. All of these reasons could be, but the point is that you can trace back the reason.
Now, post-training should mitigate this. And there is no reason one couldn't mitigate it all the way down to zero in principle. However, current evaluations are binary. “I don't know” and errors get the same penalty in all of them. So as a result, why would a model say “I don't know?” It will just guess and have a chance of getting it right.
So a proposal is to say “correct: 1,” “I don't know: 0,” and “wrong: as negative as you want,” depending on your target confidence threshold. So this you could call behavioral calibration, where you'd like the model to say “I don't know” if its confidence is less than whatever target you have, which could change depending on the context.
So why do language models hallucinate? Well, pre-trained models are calibrated and can't tell fact from fiction. Post-training penalizes “I don't know” as badly as errors. Thanks for listening.

San Diego Alignment Workshop

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI