The importance of perplexity as an evaluation metric was highlighted by OpenAI in this video on training GPT 4.5.

Perplexity measures how "surprised" a model is by a piece of text. A lower perplexity score means the model is more confident (and accurate) in predicting the next word, whereas higher perplexity indicates uncertainty.

Breaking Down Perplexity

For a sequence of words w1,w2,,wNw_1, w_2, \dots, w_N, perplexity is calculated as:

Perplexity(W)=exp(1Ni=1NlogP(wiw1,,wi1))\text{Perplexity}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_1, \dots, w_{i-1})\right)

Here's what that means:

  • NN is the total number of tokens (words) being tested.
  • P(wi)P(w_i \mid \dots) is the probability the model assigns to the actual next token wiw_i, given previous tokens. Pretty straightforward actually.

Example Calculation

Easy to understand with a toy sentence, "the dog barks," and pretend probabilities assigned by a hypothetical model:

  • P("the")=0.5P(\text{"the"}) = 0.5
  • P("dog""the")=0.3P(\text{"dog"} \mid \text{"the"}) = 0.3
  • P("barks""the dog")=0.2P(\text{"barks"} \mid \text{"the dog"}) = 0.2

Step 1: Compute log probabilities

import numpy as np

probs = [0.5, 0.3, 0.2]
log_probs = np.log(probs)
# [-0.693, -1.203 , -1.609]

Step 2: Sum the log probabilities

sum_log_probs = np.sum(log_probs)
# -3.507

Step 3: Calculate average negative log-likelihood

avg_neg_log = -sum_log_probs / len(probs)
# 1.169

Step 4: Compute perplexity

perplexity = np.exp(avg_neg_log)
# 3.218

A perplexity of 3.218 suggests the model is, on average, as confused as if it had to choose uniformly from three (3) possible words.

Key Insights

  • Lower perplexity → Better model at capturing language patterns.
  • Perplexity directly relates to cross-entropy loss; minimizing one means minimizing the other.
  • It's most meaningful when comparing different models on the same dataset.
  • OpenAI considers their internal codebase the best model eval 43:41.