andy pai's tils

Working With Dot Product And Cosine Similarity For Unit Vectors

I was reviewing the OpenAI docs for text embeddings today and came across this section:

Which distance function should I use?

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

OpenAI embeddings are normalized to length 1, which means that:

  • Cosine similarity can be computed slightly faster using just a dot product
  • Cosine similarity and Euclidean distance will result in the identical rankings

Hmm, it's been a while since I studied linear algebra, so I wanted to prove this out a bit.

Why Dot Product Equals Cosine Similarity for Unit Vectors

When two vectors are unit vectors, their dot product is the same as their cosine similarity. Here are some notes on why this is the case:

Magnitude of Vector

The magnitude (or norm) of a vector measures the "length" of the vector in Euclidean space. For vector v=(v1,v2,,vn)\mathbf{v} = (v_1, v_2, \dots, v_n), the Euclidean norm is calculated as:

v=v12+v22++vn2\|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2}

Example Calculation

# vector v =[3, 4]
v = [3 ,4]

squares = [x ** 2 for x in v]
# [9, 16]

sum_of_squares = sum(squares)
# 25

magnitude = sum_of_squares ** 0.5
# 5

Unit Vectors

To transform text embeddings into unit vectors, OpenAI would've needed to:

  1. Adjust the magnitude of a vector to be exactly 1,
  2. Preserve the direction of the vector in the vector space.

To achieve this, they would've:

  1. Calculated the magnitude of the vector as described above.
  2. Divided each element of the vector by the magnitude:

u=(v1v,v2v,,vnv)\mathbf{u} = \left( \frac{v_1}{\|\mathbf{v}\|}, \frac{v_2}{\|\mathbf{v}\|}, \dots, \frac{v_n}{\|\mathbf{v}\|} \right)

  1. The resulting vector u\mathbf{u} would then be a unit vector.

Example Calculation

# magnitude of ||v|| = 5 as calculated above
u = [x / magnitude for x in v]
# [0.6, 0.8]

sum([x ** 2 for x in u]) ** 0.5
# 1.0

We can also visually verify that by using this approach, we get a shorter vector that points in the same direction.

unit-vectory-visualization

Dot Product

The dot product of two vectors a\mathbf{a} and b\mathbf{b} is defined as:

ab=abcos(θ)\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)

where θ\theta is the angle between the vectors and a\mathbf{a} and b\mathbf{b} are the magnitudes of the vectors.

Cosine Similarity

Cosine similarity is specifically the cosine of the angle θ\theta between two vectors, which is calculated as:

cos(θ)=abab\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

Unit Vectors

For unit vectors, the magnitude of each vector is 1. Therefore, the formulas simplify:

a=1,b=1\|\mathbf{a}\| = 1, \quad \|\mathbf{b}\| = 1

This makes the dot product:

ab=11cos(θ)=cos(θ)\mathbf{a} \cdot \mathbf{b} = 1 \cdot 1 \cdot \cos(\theta) = \cos(\theta)

And the cosine similarity formula simplifies to:

cos(θ)=ab11=ab\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{1 \cdot 1} = \mathbf{a} \cdot \mathbf{b}

Thus, when the vectors are unit vectors, the dot product is exactly the cosine of the angle between them, which is the same as their cosine similarity.

Helpful Resources