Understanding Perplexity in Language Model Evaluation

2024-08-31

The importance of perplexity as an evaluation metric was highlighted by OpenAI in this video on training GPT 4.5.

Perplexity measures how "surprised" a model is by a piece of text. A lower perplexity score means the model is more confident (and accurate) in predicting the next word, whereas higher perplexity indicates uncertainty.

Breaking Down Perplexity

For a sequence of words $w_1, w_2, \dots, w_N$ , perplexity is calculated as:

\text{Perplexity}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_1, \dots, w_{i-1})\right)

Here's what that means:

$N$ is the total number of tokens (words) being tested.
$P(w_i \mid \dots)$ is the probability the model assigns to the actual next token $w_i$ , given previous tokens. Pretty straightforward actually.

Example Calculation

Easy to understand with a toy sentence, "the dog barks," and pretend probabilities assigned by a hypothetical model:

$P(\text{"the"}) = 0.5$
$P(\text{"dog"} \mid \text{"the"}) = 0.3$
$P(\text{"barks"} \mid \text{"the dog"}) = 0.2$

Step 1: Compute log probabilities

import numpy as np

probs = [0.5, 0.3, 0.2]
log_probs = np.log(probs)
# [-0.693, -1.203 , -1.609]

Step 2: Sum the log probabilities

sum_log_probs = np.sum(log_probs)
# -3.507

Step 3: Calculate average negative log-likelihood

avg_neg_log = -sum_log_probs / len(probs)
# 1.169

Step 4: Compute perplexity

perplexity = np.exp(avg_neg_log)
# 3.218

A perplexity of 3.218 suggests the model is, on average, as confused as if it had to choose uniformly from three (3) possible words.

Key Insights

Lower perplexity → Better model at capturing language patterns.
Perplexity directly relates to cross-entropy loss; minimizing one means minimizing the other.
It's most meaningful when comparing different models on the same dataset.
OpenAI considers their internal codebase the best model eval 43:41.