The Parameter Explosion Problem
A 7B parameter model has 7 billion parameters. Fine-tuning all of them requires storing 7B gradients plus optimizer states — Adam typically requires 3 copies of each parameter. A full fine-tuning run on a 70B model in 16-bit precision requires ~420GB of GPU memory. Most researchers and companies don't have 8 H100s dedicated to a single training job.
Parameter-efficient fine-tuning (PEFT) methods adapt large models to new tasks by training only a small fraction of their parameters. LoRA and QLoRA are the dominant PEFT methods, responsible for the explosion of task-specific open-source model adaptations available on HuggingFace.
LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2021) is based on a key observation: fine-tuning updates to a large weight matrix W tend to be low-rank. If the update ΔW has rank r ≪ min(d, k), we can decompose it:
ΔW = BA where B ∈ R^(d×r), A ∈ R^(r×k)
Instead of updating the full W (d×k parameters), we train only B and A (d·r + r·k parameters). For a 4096×4096 weight matrix with r=8, LoRA reduces the trainable parameters from 16.7M to 65K — a 256× reduction.
During training, the original weights W are frozen; only A and B are trained. During inference, the adapted weight is W + BA (or equivalently, BA can be absorbed into W). At typical ranks (r=4 to r=64), LoRA recovers 90-99% of full fine-tuning quality on most tasks.
Practical LoRA Configuration
Key hyperparameters:
- Rank (r): Lower rank = fewer parameters = less capacity. r=8 works for most tasks; use r=32 or r=64 for complex reasoning. Start with r=16.
- Alpha (α): Scaling factor for the LoRA update. The effective scaling is α/r. Common convention: set α = 2r (so effective scale = 2).
- Target modules: Which matrices to apply LoRA to. Typically query and value projections in attention layers; sometimes all linear layers including MLP.
- Dropout: Applied between A and B for regularization. Default 0.05.
QLoRA: Adding Quantization
QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization of the base model. The insight: during fine-tuning, we only need the base model to compute activations and gradients. If we store it in 4-bit (NF4) format, we dramatically reduce memory — and LoRA adapters are still trained in 16-bit, so gradient quality is preserved.
QLoRA memory requirements: a 7B model in 4-bit requires ~4GB. A 70B model in 4-bit requires ~40GB. With LoRA adapters (small, in 16-bit), the total typically adds 1-4GB. This means a 70B model can be fine-tuned on 2 A100s rather than 16.
QLoRA quality is usually within 1-2% of full-precision LoRA. For most applications, this is acceptable. It has made 70B-scale fine-tuning accessible to individual researchers and small companies.
Practical Recommendations
- Use QLoRA for 7B–70B models when you have limited GPU memory.
- Use full LoRA (no quantization) for 1B–7B models where memory permits.
- Use full fine-tuning only when you need to modify all model behavior (large domain adaptation, structural changes).
- The peft library (HuggingFace) provides production-ready LoRA/QLoRA implementations.
- Monitor training loss curves: LoRA training should converge similarly to full fine-tuning; significantly worse convergence suggests the rank is too low or the learning rate too high.