Chain-of-Thought Prompting: Theory and Practice

Why Scratchpads Help

When you ask an LLM a math problem, it produces a token sequence. Each token is predicted based on the preceding context. If the answer requires multiple computation steps, but the model must output the final answer immediately, it has no way to compute intermediate steps — it simply predicts the most likely token to follow the question.

Chain-of-thought (CoT) prompting changes this by instructing the model to show its reasoning before giving its final answer. The reasoning steps become part of the context that the final answer is conditioned on, giving the model the computational intermediate steps it needs.

Zero-Shot Chain-of-Thought

The simplest form, introduced by Kojima et al. (2022): simply append "Let's think step by step." to any prompt. This single phrase consistently improves performance on arithmetic, logical reasoning, and multi-step inference tasks. The improvement is often dramatic — accuracy on GSM8K math problems went from ~18% to ~48% for early GPT-3 with this simple addition.

Few-Shot Chain-of-Thought

Few-shot CoT (Wei et al., 2022) provides complete reasoning examples in the prompt. Each example shows the full step-by-step reasoning process, teaching the model the desired format and level of detail. This typically outperforms zero-shot CoT by 5-15% on challenging benchmarks but requires more prompt engineering effort.

Self-Consistency CoT

Rather than taking a single reasoning chain, self-consistency (Wang et al., 2023) generates multiple diverse reasoning paths (by sampling at temperature > 0) and takes a majority vote on the final answer. This works because different valid reasoning paths through a problem often arrive at the same correct answer, while incorrect reasoning paths diverge. Self-consistency typically improves CoT accuracy by 5-10% at the cost of 10-20× more compute.

Tree of Thoughts

Tree of Thoughts (Yao et al., 2023) extends CoT from linear chains to tree-structured exploration. The model generates multiple partial reasoning paths, evaluates their promise (using another LLM call), and expands the most promising paths. This enables backtracking and exploration that linear CoT cannot do. Particularly effective for problems where intermediate states can be evaluated (puzzles, planning, code debugging).

When to Use CoT

CoT helps when: the task requires multiple sequential reasoning steps, computation is involved (arithmetic, logic), or the answer depends on synthesis of multiple pieces of information. CoT doesn't help much for: simple factual questions, classification tasks, or short text generation. And CoT can hurt for: tasks where the correct answer is to be brief and direct, where CoT overhead isn't justified.

Modern frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro) largely do CoT automatically when the task requires it — the "think step by step" instruction is less necessary than it was with GPT-3. But for smaller models and specific complex reasoning tasks, explicit CoT prompting remains valuable.