The Linear Algebra Every ML Engineer Actually Needs

What to Learn vs. What to Skip

A typical linear algebra course covers abstract vector spaces, determinants, and theoretical proofs. Machine learning uses a specific subset of linear algebra: mostly operations on matrices and vectors that show up repeatedly. This article identifies what actually matters for ML practitioners, explains why, and points to where you'll encounter each concept.

Matrix Multiplication: The Foundation

Almost everything in deep learning is matrix multiplication. A fully connected layer: Y = XW + b, where X is the input matrix (batch_size × features), W is the weight matrix (features × outputs). An attention mechanism: Q·K^T / sqrt(d_k). The forward pass of a transformer is a sequence of matrix multiplications with nonlinearities between them.

You need to be fluent with: matrix dimensions and why they have to be compatible, the transpose operation, batch matrix multiplication (processing many matrices simultaneously), and the computational cost O(m×n×k) for multiplying an (m×k) matrix by a (k×n) matrix.

Eigenvalues and Eigenvectors

An eigenvector of matrix A is a vector v such that Av = λv — the matrix transforms it only by scaling (by eigenvalue λ), not by rotation. This shows up in ML through: PCA (principal components are eigenvectors of the covariance matrix), analyzing gradient flow (eigenvalues of the Hessian determine optimization curvature), and understanding why some layers have better gradient propagation than others.

Singular Value Decomposition (SVD)

SVD decomposes any matrix A = UΣV^T where U and V are orthogonal and Σ is diagonal. This is the factorization that underlies: data compression (keep only the top-k singular vectors), understanding matrix rank, LoRA fine-tuning (which approximates weight update matrices as low-rank = low-k SVD), and noise filtering. The singular values in Σ tell you the "intrinsic dimensionality" of the information in a matrix.

Norms and Distances

The L2 norm ||v||₂ = sqrt(Σ vᵢ²) is the standard Euclidean distance. It appears everywhere: gradient clipping (||∇||₂), regularization (L2/weight decay adds ||W||₂² to the loss), cosine similarity (measuring semantic similarity in embedding spaces), and matrix norms (measuring how much a transformation stretches vectors).

The L1 norm ||v||₁ = Σ |vᵢ| produces sparsity when used as a regularizer (Lasso regression). Understanding when to use L1 vs. L2 regularization requires understanding how these norms penalize different parameter configurations.

Gradients and the Jacobian

Backpropagation computes gradients: for a scalar loss L and a parameter vector θ, ∇L is the vector of partial derivatives ∂L/∂θᵢ. For vector-valued functions, the Jacobian J[i,j] = ∂fᵢ/∂xⱼ is the matrix of all partial derivatives. Chain rule for matrices (Jacobian chain rule) is how backprop flows through multiple layers.

You don't need to compute Jacobians by hand, but understanding what they are — and that PyTorch's autograd is computing them efficiently — is essential for debugging gradient issues and understanding why certain architectures train better than others.