Relationship Between L1 Norm and L1 Regularization

ml
code

2025-02-10

So L1 regularization is not just "named after" the L1 norm by coincidence — it directly uses the L1 norm as its penalty term in cost functions (and similarly for L2).

Here's the core insight:

The L1 norm, denoted by $\|w\|_1$ , is simply the sum of absolute values of a parameter vector. L1 regularization, or lasso, adds a penalty proportional to that same sum:

w = (3, -4) \\ \|w\|_1 = |3| + |-4| = 7 \\ \text{L1 penalty} = λ \cdot (|3| + |-4|) = 7λ

The L2 norm, denoted by $\|w\|_2$ , is the square root of the sum of squares of the components. In practice, however, L2 regularization (ridge) commonly uses the sum of squares directly (often with a $\frac{1}{2}$ factor) because its derivative is much simpler:

w = (3, -4) \\ \|w\|_2 = \sqrt{3^2 + (-4)^2} = 5 \\ \text{L2 penalty} = λ \cdot \frac{1}{2}(3^2 + (-4)^2) = λ \cdot \frac{25}{2} = 12.5λ

Although both are called "L2," ridge drops the $\sqrt{\dots}$ to simplify gradient computation.

Next, in regression:

Lasso (L1-regularized) Regression
$cost = MSE + λ \|w\|_1$
Encourages sparsity by forcing some coefficients exactly to zero. Geometrically, the diamond-shaped L1 constraint favors axis-aligned solutions.
Ridge (L2-regularized) Regression
$cost = MSE + λ\sum w_j^2$
Shrinks coefficients toward zero but rarely drives them to exactly zero. The circular L2 constraint yields smooth shrinkage around the origin.

Both regularizers are literally named after the underlying norm they employ. In regression, adding these penalties trades off data-fit (low error) against model complexity (smaller coefficients), helping prevent overfitting. The same principles apply when training neural networks, where L1 or L2 penalties are added to the network's loss function to control weight sizes.

In practice, L2 regularization (often called "weight decay") is far more common for neural networks than L1, primarily because its smoother penalty integrates well with gradient-based optimizers and effectively discourages large weights without necessarily forcing them to zero.

Key takeaways:

L1 regularization = $λ \cdot \|w\|_1$ (sum of absolute weights) → sparsity (lasso)
L2 regularization = $λ \cdot \|w\|_2^2$ (sum of squared weights) → smooth shrinkage (ridge)
Names derive directly from the mathematical norms they use, not just convention.