Key Concepts Behind QLoRA Fine-Tuning

2025-02-13

Quantize the pretrained weight matrix W to 4-bit (NF4) → memory ↓ 8×
Freeze W (no gradients).
Attach a LoRA patch: train two skinny matrices A∈ℝ^d×r, B∈ℝ^r×k with r ≪ d,k.
$W_{eff}=W+\alpha \,(A\,B).$
Back-prop only through A and B → VRAM stays tiny; training is fast.

What happens under the hood?

For each frozen Linear(in=d, out=k) layer, PEFT inserts two trainable tensors:
A (d × r) and B (r × k)
During forward() it computes W x + α (A B) x.
Gradients flow only to A and B - the 4-bit W never changes.

Full Colab-style script with SFTTrainer is in Google's Gemma QLoRA guide Google AI for Developers.

1. Simple example

Say the original layer is 3 × 3:

W = \begin{bmatrix} 1 & 2 & 0 \\ 0 & -1 & 1 \\ 1 & 0 & 1 \end{bmatrix}

Pick rank r = 1 LoRA matrices (A is 3×1, B is 1×3):

A = \begin{bmatrix} 0.5 \\ -1 \\ 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & -1 & 0.5 \end{bmatrix}, \quad \alpha = 0.1

Low-rank product (3×1 times 1×3 → 3×3)
$AB = \begin{bmatrix} 0.5 \\ -1 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & -1 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.5 & -0.5 & 0.25 \\ -1 & 1 & -0.5 \\ 1 & -1 & 0.5 \end{bmatrix}$
Scaled adapter
$\alpha (AB) = 0.1 \begin{bmatrix} 0.5 & -0.5 & 0.25 \\ -1 & 1 & -0.5 \\ 1 & -1 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.05 & -0.05 & 0.025 \\ -0.1 & 0.1 & -0.05 \\ 0.1 & -0.1 & 0.05 \end{bmatrix}$
Effective weights during training/inference
$W_{eff} = W + \alpha (AB) = \begin{bmatrix} 1 & 2 & 0 \\ 0 & -1 & 1 \\ 1 & 0 & 1 \end{bmatrix} + \begin{bmatrix} 0.05 & -0.05 & 0.025 \\ -0.1 & 0.1 & -0.05 \\ 0.1 & -0.1 & 0.05 \end{bmatrix} = \begin{bmatrix} 1.05 & 1.95 & 0.025 \\ -0.1 & -0.9 & 0.95 \\ 1.1 & -0.1 & 1.05 \end{bmatrix}$

That tiny rank-1 tweak (just 6 extra params vs. 9 frozen params) nudges every forward pass, and the optimiser only adjusts those six (6) scalars.

2. Demo - Simple PyTorch example

Example code in Google Colab here. I ran it on a free T4 GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

model_id = "google/gemma-3-1b-pt"      # any HF-hosted model works
bnb_cfg  = BitsAndBytesConfig(
    load_in_4bit=True,                 # step 1: quantise
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_cfg        # ← frozen 4-bit weights
)
tok = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# step 3: bolt on a rank-r adapter in front of every linear layer
peft_cfg = LoraConfig(
    r=16,              # the "rank" (try 4-64)
    lora_alpha=16,     # scaling factor α
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",  # patch every Dense/Linear
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg)   # model is now train-ready
model.print_trainable_parameters()

# trainable params: 13,045,760 || all params: 1,012,931,712 || trainable%: 1.2879

Another nice reference notebook on getting models setup on Google Colab here.

Take-aways

4-bit + LoRA = QLoRA - the cheapest way to nudge 7 B-parameter beasts on a laptop GPU.
The adapter really is just a low-rank matrix sitting "in front" of each frozen layer.
After training you can either:
- keep the base + adapter pair (fastest for swapping tasks), or
- merge_and_unload() to bake the patch into W_float16 and re-quantise for inference.