LoRA Fine-Tuning — AI Glossary

LoRA (Low-Rank Adaptation) is the dominant parameter-efficient fine-tuning (PEFT) method for large language models. Instead of updating all billions of a model’s weights — which requires GPU memory proportional to the full model — LoRA freezes the original weights and adds small trainable matrices alongside them. The result: the model learns new behaviours with a fraction of the parameters, making fine-tuning accessible on consumer hardware and cloud instances that would otherwise be too small.

How LoRA Works

Every weight matrix W in a transformer (typically the attention projection matrices) is frozen. LoRA adds two small matrices A and B in parallel, where the product BA approximates the update that full fine-tuning would apply. The key constraint is that A and B have a low rank r (e.g., 4, 8, or 16), so their parameter count is tiny compared to W.

During training, only A and B are updated. At inference, the product BA can be merged back into W with zero overhead. During adaptation, only ~0.1–1% of the original parameter count needs to be stored in the optimiser, slashing memory requirements by 10–20x.

The LoRA Family: Key Variants

The field has produced a rich ecosystem of LoRA extensions, each targeting a specific limitation:

QLoRA (Quantized LoRA) Combines LoRA with 4-bit quantisation of the frozen base model. The frozen weights are stored in NF4 (Normal Float 4) format, while the LoRA adapters train in bfloat16. QLoRA enabled fine-tuning of 33B-parameter models on a single 24GB GPU and 7B models on a free Colab T4, with no statistically significant quality loss versus full-precision fine-tuning.

DoRA (Weight-Decomposed LoRA, ICML 2024) DoRA decomposes each weight matrix into a magnitude component and a direction component. LoRA updates are applied only to the directional component, while magnitude is learned as a separate scalar. This decomposition more closely mirrors how full fine-tuning updates weights — separately adjusting scale and orientation — and achieves higher accuracy than standard LoRA and QLoRA. Reported gains of 37.2% on commonsense reasoning benchmarks versus comparable LoRA baselines.

VeRA (Vector-based Rank Adaptation) VeRA goes further than LoRA in parameter reduction by sharing frozen random low-rank matrices across all layers, then learning only small per-layer scaling vectors on top of them. This produces even fewer trainable parameters than QLoRA while maintaining competitive accuracy — useful for extremely memory-constrained environments.

rsLoRA (Rank-Stabilized LoRA) Standard LoRA requires careful tuning of the learning rate as rank increases, because the effective update scale changes with rank. rsLoRA introduces a rank-dependent scaling factor that stabilises training across ranks, allowing higher-rank adapters to be used reliably without extra hyperparameter search.

LoftQ (LoRA-Fine-Tuning-aware Quantization) LoftQ jointly optimises the quantisation of the base model and the initialisation of the LoRA adapters, minimising the approximation error introduced by quantisation from the start of training. This improves convergence and final quality when combining quantisation with LoRA compared to naive QLoRA.

Hybrid Methods

DVoRA — applies VeRA’s shared-matrix strategy to DoRA’s directional component, combining extreme parameter efficiency with DoRA’s decomposition benefits. Achieves notable accuracy gains with only 0.02% additional trainable parameters.
QDoRA — applies 4-bit quantisation to the base model and replaces LoRA with DoRA in the adapter, combining QLoRA’s memory savings with DoRA’s quality improvements.

2025–2026: What’s Changed

Research published in early 2026 (arXiv:2602.04998) revisited a foundational assumption: that LoRA variants consistently outperform vanilla LoRA. The paper found that carefully tuned learning rates for standard LoRA can close most of the gap to more complex variants on many benchmarks, suggesting that the learning rate is often the dominant factor and that practitioners should tune it before reaching for architectural extensions.

This has shifted practitioner guidance: start with vanilla LoRA or QLoRA, tune the learning rate, and only adopt DoRA or rsLoRA if the quality gap on your specific task justifies the added complexity.

Practical Impact in 2026

PEFT methods collectively reduce fine-tuning memory by 10–20x while retaining 90–95% of full fine-tuning quality in most domains. LoRA and its variants have made LLM specialisation genuinely accessible: a mid-range GPU, a few hundred curated examples, and an afternoon are sufficient to fine-tune a frontier model for a specific domain. This has democratised model specialisation across industries — from legal document analysis to medical coding to customer-specific chatbots — without requiring the infrastructure of the original pre-training run.

How to Use — QLoRA fine-tuning with HuggingFace PEFT + TRL

python

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTConfig, SFTTrainer
import torch

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

# 4-bit quantisation (QLoRA) — fits a 3B model on a single 16 GB GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA adapters — only ~0.5% of parameters are trainable
lora_config = LoraConfig(
    r=16,                       # rank
    lora_alpha=32,              # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset (replace with your own instruction-tuning data)
dataset = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split="train[:500]")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./lora-adapter",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        dataset_text_field="output",
    ),
)
trainer.train()
model.save_pretrained("./lora-adapter")

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams