Key Concepts Behind QLoRA Fine-Tuning
- Quantize the pretrained weight matrix W to 4-bit (NF4) → memory ↓ 8×
- Freeze W (no gradients).
- Attach a LoRA patch: train two skinny matrices A∈ℝd×r, B∈ℝr×k with r ≪ d,k.
- Back-prop only through A and B → VRAM stays tiny; training is fast.
What happens under the hood?
- For each frozen
Linear(in=d, out=k)
layer, PEFT inserts two trainable tensors:
A (d × r)
andB (r × k)
- During
forward()
it computesW x + α (A B) x
. - Gradients flow only to A and B - the 4-bit W never changes.
Full Colab-style script with SFTTrainer
is in Google's Gemma QLoRA guide Google AI for Developers.
1. Simple example
Say the original layer is 3 × 3:
Pick rank r = 1 LoRA matrices (A is 3×1, B is 1×3):
-
Low-rank product (3×1 times 1×3 → 3×3)
-
Scaled adapter
-
Effective weights during training/inference
That tiny rank-1 tweak (just 6 extra params vs. 9 frozen params) nudges every forward pass, and the optimiser only adjusts those six (6) scalars.
2. Demo - Simple PyTorch example
Example code in Google Colab here. I ran it on a free T4 GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
model_id = "google/gemma-3-1b-pt" # any HF-hosted model works
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True, # step 1: quantise
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=bnb_cfg # ← frozen 4-bit weights
)
tok = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
# step 3: bolt on a rank-r adapter in front of every linear layer
peft_cfg = LoraConfig(
r=16, # the "rank" (try 4-64)
lora_alpha=16, # scaling factor α
lora_dropout=0.05,
bias="none",
target_modules="all-linear", # patch every Dense/Linear
task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg) # model is now train-ready
model.print_trainable_parameters()
# trainable params: 13,045,760 || all params: 1,012,931,712 || trainable%: 1.2879
Another nice reference notebook on getting models setup on Google Colab here.
Take-aways
-
4-bit + LoRA = QLoRA - the cheapest way to nudge 7 B-parameter beasts on a laptop GPU.
-
The adapter really is just a low-rank matrix sitting "in front" of each frozen layer.
-
After training you can either:
- keep the base + adapter pair (fastest for swapping tasks), or
merge_and_unload()
to bake the patch into W_float16 and re-quantise for inference.