Why do today’s giant language models still rely on something as “simple” as cross‑entropy loss? Why is minimizing the negative log‑likelihood (NLL) so much more powerful than, say, squaring the difference between the true label and the predicted probability?

Wanted to save this tidy refresher.


1.  Binary Cross‑Entropy (the Bernoulli case)

For a single sample with label y0,1y\in{0,1} and model probability p=Pr(y=1)p=\Pr(y=1):

LBCE(p;y)=[ylogp+(1y)log(1p)]\mathcal L_{\text{BCE}}(p;y)= -\bigl[y\,\log p + (1-y)\,\log(1-p)\bigr]

This drops straight out of the Bernoulli likelihood

Pr(yp)=py(1p)1y\Pr(y\mid p)=p^{y}(1-p)^{1-y}

after taking log-\log so products become sums and tiny probabilities don’t underflow.

Why the log?

  • Over‑confidence tax: As p0p\to0 with y=1y=1, logp-\log p\to\infty. The optimiser gets a huge gradient kick for being confidently wrong. A linear penalty like (1p)(1-p) barely nudges.
  • Maximum‑likelihood heritage: Minimising NLL = maximising likelihood—70 yrs of statistical guarantees for free.
  • Convexity (for logistic regression): Swap logp-\log p for (1p)2(1-p)^2 and even simple models lose convexity, turning training into a swamp.

Tiny sanity check

pp (with y=1y=1) logp-\log p 1p1-p (1logp)2(1-\log p)^2
0.5 0.693 0.5 0.10
0.1 2.302 0.9 0.01
0.01 4.605 0.99 1.14

Cross‑entropy rockets when the model is sure and wrong; alternatives either plateau or explode weirdly.

PyTorch snippet

import torch, torch.nn.functional as F

y_true = torch.tensor([1, 0, 1], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.2, 0.7])
loss = F.binary_cross_entropy(y_pred, y_true)
print(loss.item())  # ≈ 0.228

2.  Categorical Cross‑Entropy (softmax generalisation)

For KK classes with one‑hot y\mathbf y and softmax probs p\mathbf p:

LCE(p;y)=i=1Kyilogpi\mathcal L_{\text{CE}}(\mathbf p;\mathbf y)=-\sum_{i=1}^{K} y_i\,\log p_i

Same story, just the likelihood of a Categorical distribution instead of Bernoulli. When K=2K=2 and you drop one term you’re back at BCE.

  • Information‑theoretic view: logpi-\log p_i is the self‑information in nats. Cross‑entropy is the expected coding cost if the world follows y\mathbf y but we code with p\mathbf p.
  • Proper scoring rule: The unique loss that’s minimised when predicted probabilities equal the true distribution. Squared‑error on logits is improper—the optimiser can game it by becoming over‑/under‑confident.

3.  Why (yp)2(y-p)^2 (MSE) falls short for classification

  1. No probabilistic foundation: There’s no Bernoulli/Categorical model whose NLL looks like (yp)2(y-p)^2.
  2. Bad calibration: MSE is an improper scoring rule—your network can shave loss by outputting systematically biased probs.
  3. Dangerous gradients: For y=1y=1, (yp)2/p=2(1p)\partial (y-p)^2/\partial p = -2(1-p). Gradients evaporate as p1p\uparrow1, slowing fine‑tuning; they’re too small when you’re wrong but uncertain.
  4. Convexity lost: Even logistic regression becomes non‑convex under MSE.

TL;DR — cross‑entropy isn’t “just a historical quirk”; it’s the mathematically inevitable choice when your model should spit out well‑calibrated probabilities.


4.  Mental model: log‑loss curves

  • Imagine plotting y=1y=1 loss vs pp.
  • The BCE curve dives to 0 at p=1p=1 and skyrockets near p=0p=0.
  • An MSE curve is a gentle U‑shape—too forgiving on bad guesses, too harsh on near‑perfect ones.

(quick plot in a notebook)

image


5. Takeaways to remember

  • Cross‑entropy = negative log‑likelihood of Bernoulli/Categorical → maximum‑likelihood estimation.
  • Log transform turns nasty probability products into friendly sums.
  • Heavy penalties for confident mistakes keep gradients healthy.
  • Still convex for single‑layer logistic models, behaves well in practice for deep nets.
  • Use BCE for binary tasks, CE for multi‑class; reserve MSE for regression.