1.  Binary Cross‑Entropy (the Bernoulli case)
Why the log?
Tiny sanity check
PyTorch snippet
2.  Categorical Cross‑Entropy (softmax generalisation)
3.  Why (y−p)2(y-p)^2(y−p)2 (MSE) falls short for classification
4.  Mental model: log‑loss curves
5. Takeaways to remember

Why Cross‑Entropy Still Rules Loss Functions

2025-05-08

Why do today’s giant language models still rely on something as “simple” as cross‑entropy loss? Why is minimizing the negative log‑likelihood (NLL) so much more powerful than, say, squaring the difference between the true label and the predicted probability?

Wanted to save this tidy refresher.

1.  Binary Cross‑Entropy (the Bernoulli case)

For a single sample with label $y\in{0,1}$ and model probability $p=\Pr(y=1)$ :

$\mathcal L_{\text{BCE}}(p;y)= -\bigl[y\,\log p + (1-y)\,\log(1-p)\bigr]$

This drops straight out of the Bernoulli likelihood

$\Pr(y\mid p)=p^{y}(1-p)^{1-y}$

after taking $-\log$ so products become sums and tiny probabilities don’t underflow.

Why the log?

Over‑confidence tax: As $p\to0$ with $y=1$ , $-\log p\to\infty$ . The optimiser gets a huge gradient kick for being confidently wrong. A linear penalty like $(1-p)$ barely nudges.
Maximum‑likelihood heritage: Minimising NLL = maximising likelihood—70 yrs of statistical guarantees for free.
Convexity (for logistic regression): Swap $-\log p$ for $(1-p)^2$ and even simple models lose convexity, turning training into a swamp.

Tiny sanity check

$p$ (with $y=1$ )	$-\log p$	$1-p$	$(1-\log p)^2$
0.5	0.693	0.5	0.10
0.1	2.302	0.9	0.01
0.01	4.605	0.99	1.14

Cross‑entropy rockets when the model is sure and wrong; alternatives either plateau or explode weirdly.

PyTorch snippet

import torch, torch.nn.functional as F

y_true = torch.tensor([1, 0, 1], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.2, 0.7])
loss = F.binary_cross_entropy(y_pred, y_true)
print(loss.item())  # ≈ 0.228

2.  Categorical Cross‑Entropy (softmax generalisation)

For $K$ classes with one‑hot $\mathbf y$ and softmax probs $\mathbf p$ :

$\mathcal L_{\text{CE}}(\mathbf p;\mathbf y)=-\sum_{i=1}^{K} y_i\,\log p_i$

Same story, just the likelihood of a Categorical distribution instead of Bernoulli. When $K=2$ and you drop one term you’re back at BCE.

Information‑theoretic view: $-\log p_i$ is the self‑information in nats. Cross‑entropy is the expected coding cost if the world follows $\mathbf y$ but we code with $\mathbf p$ .
Proper scoring rule: The unique loss that’s minimised when predicted probabilities equal the true distribution. Squared‑error on logits is improper—the optimiser can game it by becoming over‑/under‑confident.

3.  Why $(y-p)^2$ (MSE) falls short for classification

No probabilistic foundation: There’s no Bernoulli/Categorical model whose NLL looks like $(y-p)^2$ .
Bad calibration: MSE is an improper scoring rule—your network can shave loss by outputting systematically biased probs.
Dangerous gradients: For $y=1$ , $\partial (y-p)^2/\partial p = -2(1-p)$ . Gradients evaporate as $p\uparrow1$ , slowing fine‑tuning; they’re too small when you’re wrong but uncertain.
Convexity lost: Even logistic regression becomes non‑convex under MSE.

TL;DR — cross‑entropy isn’t “just a historical quirk”; it’s the mathematically inevitable choice when your model should spit out well‑calibrated probabilities.

4.  Mental model: log‑loss curves

Imagine plotting $y=1$ loss vs $p$ .
The BCE curve dives to 0 at $p=1$ and skyrockets near $p=0$ .
An MSE curve is a gentle U‑shape—too forgiving on bad guesses, too harsh on near‑perfect ones.

(quick plot in a notebook)

5. Takeaways to remember

Cross‑entropy = negative log‑likelihood of Bernoulli/Categorical → maximum‑likelihood estimation.
Log transform turns nasty probability products into friendly sums.
Heavy penalties for confident mistakes keep gradients healthy.
Still convex for single‑layer logistic models, behaves well in practice for deep nets.
Use BCE for binary tasks, CE for multi‑class; reserve MSE for regression.

1. Binary Cross‑Entropy (the Bernoulli case)

Why the log?

Tiny sanity check

PyTorch snippet

2. Categorical Cross‑Entropy (softmax generalisation)

3. Why (y−p)2(y-p)^2(y−p)2 (MSE) falls short for classification

4. Mental model: log‑loss curves

5. Takeaways to remember

1.  Binary Cross‑Entropy (the Bernoulli case)

2.  Categorical Cross‑Entropy (softmax generalisation)

3.  Why $(y-p)^2$ (MSE) falls short for classification

4.  Mental model: log‑loss curves

5. Takeaways to remember