Interpretability: Opening the Black Box

The Opacity Problem

Modern neural networks are "black boxes": they map inputs to outputs through billions of learned parameters in ways that resist human interpretation. A neural network that denies a loan application doesn't "explain" its decision — it produces a number. This opacity creates problems for trust, accountability, debugging, and safety: we can measure a model's average accuracy but don't understand when and why it will fail.

Interpretability research seeks to understand what AI systems are doing internally — not just what outputs they produce, but what computations they perform, what concepts they represent, and why they make specific predictions.

Post-Hoc Explanations

Post-hoc explanation methods explain individual predictions without modifying the model:

  • LIME: Fits a locally linear model around a specific prediction. Perturb the input slightly, see how the output changes, build an interpretable proxy.
  • SHAP: Distributes a prediction's value across features based on Shapley values from cooperative game theory. More principled than LIME but computationally expensive for large models.
  • Gradient-based attribution: The gradient of the output with respect to each input feature measures how much that feature contributes to the prediction. GradCAM for images produces saliency maps highlighting which image regions the model uses.

Caveat: post-hoc explanations are explanations of a simpler proxy model, not of the original neural network. They can be misleading, particularly for features that are correlated or for models with non-linear feature interactions.

Mechanistic Interpretability

Mechanistic interpretability attempts to reverse-engineer neural networks: identify specific circuits (groups of neurons and weights) that implement specific algorithms. This is harder than post-hoc explanation but more truthful: you're actually understanding the model's internal computation.

Key results: Elhage et al. identified "induction heads" in transformer models — circuits that perform in-context learning by finding and copying patterns. Multiple groups have identified circuits for addition, indirect object identification, and simple factual recall in language models. The field has developed tools like activation patching (ablate specific neurons and see which predictions change), probing classifiers (train a linear probe to decode specific features from activations), and logit attribution (identify which attention heads most influence specific output tokens).

Superposition and the Limits of Current Methods

A major challenge: neurons are polysemantic — individual neurons respond to multiple unrelated concepts (a single neuron might respond to both "European royalty" and "curry"). This "superposition" suggests that concepts are represented in high-dimensional activation spaces rather than individual neurons, making per-neuron analysis incomplete. Sparse autoencoders (SAE) have recently shown promise for disentangling superposed features into more interpretable "dictionary elements." This is an active research area with significant implications for AI safety.