Techniques & Methods

Chain-of-Thought

Chain-of-thought (CoT) is a prompting technique that asks a language model to produce step-by-step intermediate reasoning before producing a final answer. The canonical trigger phrase — "Let's think step by step" — was demonstrated by Wei et al. (Google Brain, 2022) to dramatically improve accuracy on math, logic, and multi-step reasoning tasks. By giving the model space to externalize its reasoning, CoT trades off latency and token count for accuracy on tasks where direct output frequently fails.

CoT works because language models generate tokens autoregressively — each token conditions on the previous ones. When the model is forced to articulate intermediate steps, those tokens become available as context for later decisions, effectively giving the model "working memory" that wouldn't exist if it tried to output a final answer directly. This is why CoT improves accuracy on problems requiring multiple inference steps but not on problems solvable in a single step.

CoT has scaled beyond simple prompting into multiple variants: zero-shot CoT (just add "think step by step"); few-shot CoT (provide worked examples showing the desired reasoning pattern); self-consistency CoT (sample multiple reasoning chains and pick the majority answer); Tree of Thoughts (explore multiple reasoning branches and prune); and Graph of Thoughts (allow non-linear reasoning structures). Each adds complexity for incremental accuracy gains on harder problems.

In 2024-2026, "reasoning models" — OpenAI o1/o3, DeepSeek R1, Claude with extended thinking, Gemini Flash Thinking — productized CoT by training models to do extended internal reasoning before outputting a response. The user no longer needs to prompt for CoT explicitly; the model decides when to think and how deeply. Reasoning models substantially outperform non-reasoning models on math, code, and complex analysis tasks at the cost of higher latency and per-query cost.

CoT has well-known limitations: it adds latency and token cost (longer outputs are slower and more expensive); it can be misleading (the reasoning trace may look plausible but reach a wrong answer); and faithfulness is imperfect (the model's stated reasoning may not reflect the actual computation that produced the answer). For high-stakes use, CoT outputs should be verified, not trusted as authoritative explanations.

Why it matters in GEO / AI search

For content publishers, CoT is mostly relevant indirectly — as the mechanism behind why AI search engines produce more substantive, reasoned answers in 2026 than they did in 2023. ChatGPT Search, Perplexity Pro, and Claude with extended thinking all do internal CoT before generating citations. The downstream effect is that they're better at evaluating source quality, less likely to cite low-quality pages, and more likely to weight pages whose claims are clearly substantiated.

This raises the bar for citation. Pages that get cited in 2026 are pages the model evaluates favorably during its reasoning step — fact-dense, attributable, internally consistent, and structurally clear. Pages with vague claims, unverified statistics, or contradictory statements get filtered out at the reasoning stage even if they're structurally well-optimized. The implication for GEO is that on-page accuracy and substantiation matter more than ever; surface-level optimization without substance underperforms.

For internal AI-driven workflows (research, analysis, content production), CoT is the practical reason to use reasoning models for important tasks and fast non-reasoning models for routine ones. A research synthesis benefits dramatically from o3 or Claude with extended thinking; a quick autocomplete or classification benefits from GPT-4o or Haiku. Knowing which to deploy when is operational maturity.

Examples

Zero-shot CoT

Append "Let's think step by step" to a math word problem. Accuracy on multi-step problems typically improves 20-40% on older or smaller models. Frontier models often do this implicitly even when not prompted.

Few-shot CoT

Provide 2-3 worked examples showing the desired step-by-step reasoning pattern, then ask the model to apply the same pattern to a new problem. More effective than zero-shot for specialized formats.

Self-consistency CoT

Sample the same CoT prompt 5-20 times with temperature > 0, then take the majority answer. Boosts accuracy on hard reasoning tasks at 5-20x the inference cost.

Productized reasoning (o1 / o3 / extended thinking)

Modern reasoning models do extended internal CoT before responding — sometimes thinking for minutes on complex problems. The user doesn't see the reasoning trace by default; they see a more accurate final answer.

Authority Links

Chain-of-Thought Paper — arXiv 2201.11903

Wei et al., the original Google Brain paper introducing chain-of-thought prompting.

IBM — CoT Prompting

How chain-of-thought prompting improves LLM reasoning.

Self-Consistency — arXiv 2203.11171

Wang et al., self-consistency sampling as an extension of CoT.

Related Terms

Techniques & Methods

Prompt Engineering

The discipline of designing input text — instructions, examples, constraints, and context — to reliably steer a language model toward accurate, well-formatted, and intent-aligned outputs without modifying model weights.

Techniques & Methods

Few-Shot Learning

Model's ability to generalize from only a handful of labeled examples.

Techniques & Methods

Prompt

Text input provided to an AI model to guide the content and format of its response.

Techniques & Methods

Evaluation Metrics

Quantitative measures used to assess how well an AI model performs on a task.

Completion Beam Search