So L1 regularization is not just "named after" the L1 norm by coincidence — it directly uses the L1 norm as its penalty term in cost functions (and similarly for L2).

Here's the core insight:

The L1 norm, denoted by w1\|w\|_1, is simply the sum of absolute values of a parameter vector. L1 regularization, or lasso, adds a penalty proportional to that same sum:

w=(3,4)w1=3+4=7L1 penalty=λ(3+4)=7λw = (3, -4) \\ \|w\|_1 = |3| + |-4| = 7 \\ \text{L1 penalty} = λ \cdot (|3| + |-4|) = 7λ

The L2 norm, denoted by w2\|w\|_2, is the square root of the sum of squares of the components. In practice, however, L2 regularization (ridge) commonly uses the sum of squares directly (often with a 12\frac{1}{2} factor) because its derivative is much simpler:

w=(3,4)w2=32+(4)2=5L2 penalty=λ12(32+(4)2)=λ252=12.5λw = (3, -4) \\ \|w\|_2 = \sqrt{3^2 + (-4)^2} = 5 \\ \text{L2 penalty} = λ \cdot \frac{1}{2}(3^2 + (-4)^2) = λ \cdot \frac{25}{2} = 12.5λ

Although both are called "L2," ridge drops the \sqrt{\dots} to simplify gradient computation.

Next, in regression:

  • Lasso (L1-regularized) Regression

    cost=MSE+λw1cost = MSE + λ \|w\|_1

    Encourages sparsity by forcing some coefficients exactly to zero. Geometrically, the diamond-shaped L1 constraint favors axis-aligned solutions.

  • Ridge (L2-regularized) Regression

    cost=MSE+λwj2cost = MSE + λ\sum w_j^2

    Shrinks coefficients toward zero but rarely drives them to exactly zero. The circular L2 constraint yields smooth shrinkage around the origin.

Both regularizers are literally named after the underlying norm they employ. In regression, adding these penalties trades off data-fit (low error) against model complexity (smaller coefficients), helping prevent overfitting. The same principles apply when training neural networks, where L1 or L2 penalties are added to the network's loss function to control weight sizes.

In practice, L2 regularization (often called "weight decay") is far more common for neural networks than L1, primarily because its smoother penalty integrates well with gradient-based optimizers and effectively discourages large weights without necessarily forcing them to zero.

Key takeaways:

  • L1 regularization = λw1λ \cdot \|w\|_1 (sum of absolute weights) → sparsity (lasso)
  • L2 regularization = λw22λ \cdot \|w\|_2^2 (sum of squared weights) → smooth shrinkage (ridge)
  • Names derive directly from the mathematical norms they use, not just convention.