The LLM Shadow
Since ChatGPT launched in late 2022, large language models have dominated AI discussions. Budgets, talent, and media attention have concentrated on text and image generation. But reinforcement learning — the framework for training agents through trial and reward — has been quietly having one of its most productive periods ever. Understanding why requires stepping back from the current hype cycle.
What RL Is That LLMs Aren't
Language models are trained on static data: text that was written before training began. They're optimized to predict the next token, which is a powerful proxy for many tasks but doesn't directly optimize for taking actions in the world. Reinforcement learning trains on experience: an agent takes actions, observes outcomes, receives rewards, and adjusts its behavior to maximize cumulative reward over time.
This distinction matters for any task that requires: sequential decision-making (where current actions affect future options), optimization against sparse rewards (the right answer isn't evident until many steps later), physical interaction (robots don't get "next token" feedback), or discovering strategies humans haven't thought of.
RL Inside LLMs
Ironically, the biggest recent advance in LLMs — alignment via RLHF — is reinforcement learning. The "RL" in RLHF is used to train language models to be more helpful and less harmful. DeepSeek R1's chain-of-thought reasoning was discovered through reinforcement learning from outcome-based rewards. OpenAI's o1 and o3 models use RL-based inference-time scaling. RL is the core technology enabling the most capable aspects of modern language models.
Physical AI: The Next Frontier
Physical AI — robots that can perform complex manipulation and locomotion in unstructured environments — requires RL at its core. Boston Dynamics, Figure, Physical Intelligence (Pi), and 1X Technologies are all investing heavily in RL-based control. The sim-to-real pipeline (train in simulation, deploy in the real world) has matured dramatically: models trained entirely in simulation now control real robots with sub-second response times and human-competitive dexterity on specific tasks.
AlphaFold 2 and Scientific Discovery
AlphaFold 2, which solved the 50-year-old protein structure prediction problem, used a combination of transformer architecture and evolutionary data — but the refinement and confidence calibration components relied on RL-style iterative optimization. This is characteristic of RL's role in scientific AI: enabling optimization toward complex, hard-to-specify objectives.
Recommendation and Control Systems
The systems that most people interact with most often — content recommendation, ad placement, logistics routing, trading algorithms — are RL systems. YouTube's recommendation system optimizes watch time (a reward signal) through bandit-style RL. Uber's pricing is a contextual bandit. FedEx's route optimization uses deep RL. These systems are less glamorous than image generation but more economically significant.
Anyone building AI applications beyond text generation will likely encounter reinforcement learning sooner or later. Meridian AI's MS in Reinforcement Learning & Autonomous Systems prepares students for exactly this landscape.