1. Quantize the pretrained weight matrix W to 4-bit (NF4) → memory ↓ 8×
  2. Freeze W (no gradients).
  3. Attach a LoRA patch: train two skinny matrices A∈ℝd×r, B∈ℝr×k with r ≪ d,k.
    Weff=W+α(AB).W_{eff}=W+\alpha \,(A\,B).
  4. Back-prop only through A and B → VRAM stays tiny; training is fast.

What happens under the hood?

  • For each frozen Linear(in=d, out=k) layer, PEFT inserts two trainable tensors:
    A (d × r) and B (r × k)
  • During forward() it computes W x + α (A B) x.
  • Gradients flow only to A and B - the 4-bit W never changes.

Full Colab-style script with SFTTrainer is in Google's Gemma QLoRA guide Google AI for Developers.

1. Simple example

Say the original layer is 3 × 3:

W=[120011101]W = \begin{bmatrix} 1 & 2 & 0 \\ 0 & -1 & 1 \\ 1 & 0 & 1 \end{bmatrix}

Pick rank r = 1 LoRA matrices (A is 3×1, B is 1×3):

A=[0.511],B=[110.5],α=0.1A = \begin{bmatrix} 0.5 \\ -1 \\ 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & -1 & 0.5 \end{bmatrix}, \quad \alpha = 0.1

  1. Low-rank product (3×1 times 1×3 → 3×3)

    AB=[0.511][110.5]=[0.50.50.25110.5110.5]AB = \begin{bmatrix} 0.5 \\ -1 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & -1 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.5 & -0.5 & 0.25 \\ -1 & 1 & -0.5 \\ 1 & -1 & 0.5 \end{bmatrix}

  2. Scaled adapter

    α(AB)=0.1[0.50.50.25110.5110.5]=[0.050.050.0250.10.10.050.10.10.05]\alpha (AB) = 0.1 \begin{bmatrix} 0.5 & -0.5 & 0.25 \\ -1 & 1 & -0.5 \\ 1 & -1 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.05 & -0.05 & 0.025 \\ -0.1 & 0.1 & -0.05 \\ 0.1 & -0.1 & 0.05 \end{bmatrix}

  3. Effective weights during training/inference

    Weff=W+α(AB)=[120011101]+[0.050.050.0250.10.10.050.10.10.05]=[1.051.950.0250.10.90.951.10.11.05]W_{eff} = W + \alpha (AB) = \begin{bmatrix} 1 & 2 & 0 \\ 0 & -1 & 1 \\ 1 & 0 & 1 \end{bmatrix} + \begin{bmatrix} 0.05 & -0.05 & 0.025 \\ -0.1 & 0.1 & -0.05 \\ 0.1 & -0.1 & 0.05 \end{bmatrix} = \begin{bmatrix} 1.05 & 1.95 & 0.025 \\ -0.1 & -0.9 & 0.95 \\ 1.1 & -0.1 & 1.05 \end{bmatrix}

That tiny rank-1 tweak (just 6 extra params vs. 9 frozen params) nudges every forward pass, and the optimiser only adjusts those six (6) scalars.


2. Demo - Simple PyTorch example

Example code in Google Colab here. I ran it on a free T4 GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

model_id = "google/gemma-3-1b-pt"      # any HF-hosted model works
bnb_cfg  = BitsAndBytesConfig(
    load_in_4bit=True,                 # step 1: quantise
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_cfg        # ← frozen 4-bit weights
)
tok = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# step 3: bolt on a rank-r adapter in front of every linear layer
peft_cfg = LoraConfig(
    r=16,              # the "rank" (try 4-64)
    lora_alpha=16,     # scaling factor α
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",  # patch every Dense/Linear
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg)   # model is now train-ready
model.print_trainable_parameters()

# trainable params: 13,045,760 || all params: 1,012,931,712 || trainable%: 1.2879

Another nice reference notebook on getting models setup on Google Colab here.


Take-aways

  • 4-bit + LoRA = QLoRA - the cheapest way to nudge 7 B-parameter beasts on a laptop GPU.

  • The adapter really is just a low-rank matrix sitting "in front" of each frozen layer.

  • After training you can either:

    • keep the base + adapter pair (fastest for swapping tasks), or
    • merge_and_unload() to bake the patch into W_float16 and re-quantise for inference.