Why Cross‑Entropy Still Rules Loss Functions
Why do today’s giant language models still rely on something as “simple” as cross‑entropy loss? Why is minimizing the negative log‑likelihood (NLL) so much more powerful than, say, squaring the difference between the true label and the predicted probability?
Wanted to save this tidy refresher.
1. Binary Cross‑Entropy (the Bernoulli case)
For a single sample with label and model probability :
This drops straight out of the Bernoulli likelihood
after taking so products become sums and tiny probabilities don’t underflow.
Why the log?
- Over‑confidence tax: As with , . The optimiser gets a huge gradient kick for being confidently wrong. A linear penalty like barely nudges.
- Maximum‑likelihood heritage: Minimising NLL = maximising likelihood—70 yrs of statistical guarantees for free.
- Convexity (for logistic regression): Swap for and even simple models lose convexity, turning training into a swamp.
Tiny sanity check
(with ) | |||
---|---|---|---|
0.5 | 0.693 | 0.5 | 0.10 |
0.1 | 2.302 | 0.9 | 0.01 |
0.01 | 4.605 | 0.99 | 1.14 |
Cross‑entropy rockets when the model is sure and wrong; alternatives either plateau or explode weirdly.
PyTorch snippet
import torch, torch.nn.functional as F
y_true = torch.tensor([1, 0, 1], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.2, 0.7])
loss = F.binary_cross_entropy(y_pred, y_true)
print(loss.item()) # ≈ 0.228
2. Categorical Cross‑Entropy (softmax generalisation)
For classes with one‑hot and softmax probs :
Same story, just the likelihood of a Categorical distribution instead of Bernoulli. When and you drop one term you’re back at BCE.
- Information‑theoretic view: is the self‑information in nats. Cross‑entropy is the expected coding cost if the world follows but we code with .
- Proper scoring rule: The unique loss that’s minimised when predicted probabilities equal the true distribution. Squared‑error on logits is improper—the optimiser can game it by becoming over‑/under‑confident.
3. Why (MSE) falls short for classification
- No probabilistic foundation: There’s no Bernoulli/Categorical model whose NLL looks like .
- Bad calibration: MSE is an improper scoring rule—your network can shave loss by outputting systematically biased probs.
- Dangerous gradients: For , . Gradients evaporate as , slowing fine‑tuning; they’re too small when you’re wrong but uncertain.
- Convexity lost: Even logistic regression becomes non‑convex under MSE.
TL;DR — cross‑entropy isn’t “just a historical quirk”; it’s the mathematically inevitable choice when your model should spit out well‑calibrated probabilities.
4. Mental model: log‑loss curves
- Imagine plotting loss vs .
- The BCE curve dives to 0 at and skyrockets near .
- An MSE curve is a gentle U‑shape—too forgiving on bad guesses, too harsh on near‑perfect ones.
(quick plot in a notebook)
5. Takeaways to remember
- Cross‑entropy = negative log‑likelihood of Bernoulli/Categorical → maximum‑likelihood estimation.
- Log transform turns nasty probability products into friendly sums.
- Heavy penalties for confident mistakes keep gradients healthy.
- Still convex for single‑layer logistic models, behaves well in practice for deep nets.
- Use BCE for binary tasks, CE for multi‑class; reserve MSE for regression.