Curse of Dimensionality
Today I revisited the "curse of dimensionality," a concept ML folks love to cite whenever algos have to deal with high-dimensional data. I'd always understood it at a surface level — more dimensions, more possible explanations, more data needed — but reading Anil Ananthaswamy's book, Why Machines Learn, gave me a fresh perspective I hadn't seen before.
-
Data sparsity: In thousands of dimensions, data points become extremely isolated. As Julie Deon memorably said, "In high-dimensional spaces, nobody can hear you scream."
-
Corner drift: As dimensions grow, randomly placed points naturally drift toward the corners of the hypercube, making distances confusingly similar.
1. Why data gets sparse
As you add more dimensions, the amount of "space" grows exponentially. If your dataset size doesn't keep up (and usually, it can't), your points end up extremely far apart. A quick example:
Imagine 10 random points placed uniformly in a d-dimensional cube [0,1]^d. To compute average nearest-neighbor distance:
- Randomly generate points uniformly across each dimension.
- Compute Euclidean distance between every pair of points.
- For each point, identify the closest neighbor and record this distance.
- Average these distances across all points.
Python code to simulate:
import numpy as np
def avg_nearest_neighbor_distance(dim, num_points=10, trials=1000):
avg_distances = []
for _ in range(trials):
# generate random points in a d-dimensional unit cube
points = np.random.rand(num_points, dim)
# a) calculates the difference vectors between all pairs of points
# b) np.newaxis is used to expand the dimensions of points to
# match the shape of points
# c) np.linalg.norm computes the Euclidean distance between each pair
# including the distance between each point and itself
dist_matrix = np.linalg.norm(points[:, np.newaxis] - points, axis=2)
# set self-distances to infinity
np.fill_diagonal(dist_matrix, np.inf)
# find the minimum distance for each point
min_distances = dist_matrix.min(axis=1)
# calculate the average of minimum distances
avg_distances.append(min_distances.mean())
return np.mean(avg_distances)
for d in [1, 2, 3, 10]:
print(f"Dimension {d}: {avg_nearest_neighbor_distance(d):.2f}")
Here's what happens practically:
Dimension | Avg. distance to nearest neighbor |
---|---|
1 | ~0.05 (dense) |
2 | ~0.20 |
3 | ~0.30 |
10 | ~0.90 (almost max distance!) |
In high dimensions, your points become so far apart they're essentially isolated. Algorithms like k-NN can't find meaningful neighbors because there aren't any close ones left.
2. Why points drift toward corners (with calculations)
Here's a clearer explanation with concrete examples:
- Place a hypercube around the origin, from [-1, 1] in every dimension.
- Inside this cube, place a sphere just touching each cube face, radius exactly 1.
- Compare volumes of spheres and cubes:
In 3 Dimensions:
- Sphere volume:
- Cube volume:
- Volume ratio:
In 4 Dimensions:
- Sphere volume:
- Cube volume:
- Volume ratio:
As dimensions increase, the sphere occupies drastically less volume relative to the cube:
- At 10D, this ratio drops to approximately 0.002%.
This shrinking central sphere means points randomly generated uniformly within the cube naturally drift toward the corners, far away from the center.
3. Practical takeaway: distances stop making sense
Because all points end up far and roughly equally distant, distance-based algorithms (like k-NN and clustering) struggle badly. They assume similar points are closer, which just isn't true anymore in high dimensions.
4. How to beat the curse: dimensionality reduction
To fix this, reduce dimensionality early and often:
- PCA: Finds fewer dimensions that capture most of the data's variance. (
scikit-learn
) - t-SNE/UMAP: Nonlinear methods preserving local structure in fewer dimensions. (
t-SNE
,UMAP
) - Autoencoders: Neural networks compressing data into lower-dimensional forms. (Keras guide)
Key Takeaways
- High-dimensional spaces grow exponentially, leaving your points lonely.
- Points naturally drift toward hypercube corners, making distances meaningless.
- Distance-based ML algorithms struggle unless you actively reduce dimensionality.