Understanding Perplexity in Language Model Evaluation
The importance of perplexity as an evaluation metric was highlighted by OpenAI in this video on training GPT 4.5.
Perplexity measures how "surprised" a model is by a piece of text. A lower perplexity score means the model is more confident (and accurate) in predicting the next word, whereas higher perplexity indicates uncertainty.
Breaking Down Perplexity
For a sequence of words , perplexity is calculated as:
Here's what that means:
- is the total number of tokens (words) being tested.
- is the probability the model assigns to the actual next token , given previous tokens. Pretty straightforward actually.
Example Calculation
Easy to understand with a toy sentence, "the dog barks," and pretend probabilities assigned by a hypothetical model:
Step 1: Compute log probabilities
import numpy as np
probs = [0.5, 0.3, 0.2]
log_probs = np.log(probs)
# [-0.693, -1.203 , -1.609]
Step 2: Sum the log probabilities
sum_log_probs = np.sum(log_probs)
# -3.507
Step 3: Calculate average negative log-likelihood
avg_neg_log = -sum_log_probs / len(probs)
# 1.169
Step 4: Compute perplexity
perplexity = np.exp(avg_neg_log)
# 3.218
A perplexity of 3.218 suggests the model is, on average, as confused as if it had to choose uniformly from three (3) possible words.
Key Insights
- Lower perplexity → Better model at capturing language patterns.
- Perplexity directly relates to cross-entropy loss; minimizing one means minimizing the other.
- It's most meaningful when comparing different models on the same dataset.
- OpenAI considers their internal codebase the best model eval 43:41.