FAISS vs pgvector - why one's a library and the other's a database

ai
code

2024-01-02

Today I bumped into FAISS (Facebook AI Similarity Search) and wondered how it stacks up against my default choice, pgvector.

Turns out they live in different zip codes:

	FAISS	pgvector
What it is	C++/Python library	Postgres extension
Lives	RAM (you serialise manually)	Durable tables
Query API	Python / C++	SQL
Metadata filters	DIY	`WHERE …`
Scale-out	Shard yourself	Standard Postgres tooling
Sweet spot	Ultra-low-latency top-K look-ups inside an ML pipeline	Product features that need storage + vector search

Toy problem - modified MNIST

“Given a query digit, return the 5 nearest neighbours that share its label.”

Source data lives at
https://datasets.r2pi.co/mnist_train.json
Each record is:

{
  "image": [0, 0, …, 0],   // length-196 float/int
  "label": 7               // 0-9
}

1. FAISS (Python)

# for colab
# %pip install faiss-cpu

import json
import requests
import numpy as np
import faiss
import pickle
from tabulate import tabulate

url   = "https://datasets.r2pi.co/mnist_train.json"
items = json.loads(requests.get(url).text)

vecs   = np.array([x["image"] for x in items], dtype="float32")
labels = np.array([x["label"] for x in items], dtype="int64")

vecs[123], labels[123]

# (array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0.,
#         0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32),
#  np.int64(7))

# (optional) L2-normalise so inner-product ≈ cosine
faiss.normalize_L2(vecs)

# --- build / persist index ---------------------------------------------
index = faiss.IndexFlatIP(196)          # 196-dim inner-product
index.add(vecs)                         # lives only in RAM
pickle.dump(index, open("mnist.ip", "wb"))

# --- query --------------------------------------------------------------
q      = vecs[123:124]                  # pretend row-123 is our query, label 7

# --- check top-5 ----------------------------------------------------------
table_data = []
for i in range(len(ids[0][:5])):  # Assuming ids is a 2D array
    table_data.append({"id": ids[0][i], "label": labels[ids[0][i]]})

# Use tabulate to create a formatted table
print(tabulate(table_data, headers="keys", tablefmt="grid"))

+------+---------+
|   id |   label |
+======+=========+
|  123 |       7 |
+------+---------+
| 3788 |       7 |
+------+---------+
| 5253 |       7 |
+------+---------+
| 1513 |       7 |
+------+---------+
|  911 |       7 |
+------+---------+

This ran blazing fast in a colab notebook (sub-ms on CPU / GPU), but storage and filtering are on you.

2. pgvector (SQL)

-- one-time setup
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE digits (
  id       BIGSERIAL PRIMARY KEY,
  label    INT,
  embedding VECTOR(196)
);

# load data from Python (snippet) ------------
import psycopg2, json, requests, numpy as np
conn = psycopg2.connect("postgresql://…")
cur  = conn.cursor()

items = json.loads(requests.get(url).text)
for it in items:
    cur.execute(
      "INSERT INTO digits (label, embedding) VALUES (%s, %s)",
      (it["label"], it["image"])
    )
conn.commit()

-- single-round-trip query
SELECT id, label, embedding <-> '[…query 196-vec…]' AS dist
FROM   digits
WHERE  label = 7
ORDER  BY dist
LIMIT  5;

What you get for free: durable storage, ACID transactions, WHERE label = … in the same call, standard backups & auth.

When to reach for which

FAISS - lightning-fast, GPU-friendly inside an ML pipeline when you're happy to own persistence + filtering.
pgvector - your vectors are the data and need SQL joins, backups, migrations, or RBAC.
Hybrid - store embeddings in Postgres for reliability, dump them to FAISS for offline crunching when latency is king.