Chain-of-Thought Prompting — AI Glossary

When a language model jumps directly from a question to an answer, it collapses all intermediate reasoning into a single prediction step. For simple tasks this works fine. For anything requiring multi-step logic — arithmetic, commonsense reasoning, symbolic manipulation, planning — the error rate climbs because every intermediate inference is made implicitly and cannot be checked or corrected.

Chain-of-Thought (CoT) prompting addresses this by making the model externalise its reasoning. Instead of predicting the answer directly, the model first generates the reasoning steps, then arrives at a conclusion supported by that visible chain. The result: higher accuracy, more interpretable outputs, and errors that are catchable mid-stream rather than baked into a confident wrong answer.

The Original Finding

Introduced by Wei et al. at Google Brain in 2022, the core experiment was straightforward: provide a few-shot prompt where each example includes not just the question and answer, but the reasoning steps between them. On arithmetic word problems and commonsense benchmarks, this single change produced dramatic accuracy gains on models at sufficient scale (roughly 100B+ parameters). Smaller models showed minimal benefit — CoT is an emergent capability of large enough models.

Example — standard prompting:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now? A: 11.

Example — chain-of-thought prompting:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now? A: Roger starts with 5 balls. 2 cans × 3 balls = 6 balls purchased. 5 + 6 = 11. The answer is 11.

The answer is the same, but the model that produced the second response is far less likely to make arithmetic errors on harder variants of the same problem.

Diagram comparing direct prompting (Question → Answer) vs chain-of-thought prompting (Question → Step 1 → Step 2 → Step 3 → Answer)

Zero-Shot Chain-of-Thought

A follow-up paper (Kojima et al., 2022) discovered that simply appending “Let’s think step by step” to a prompt — with no examples at all — triggers similar chain-of-thought reasoning in capable models. This eliminated the need to write demonstrations, making CoT trivially accessible for any task.

Zero-shot CoT works because the phrase “let’s think step by step” activates patterns from the model’s training data where careful, sequential reasoning appears — academic writing, worked mathematical solutions, structured analysis. It’s a shortcut to the same behavioural mode that few-shot examples induce.

Self-Consistency: Ensembling CoT

A significant extension of CoT (Wang et al., 2022) noted that generating a single chain of thought is still a single sample from a distribution. The self-consistency technique:

Generates multiple independent chains of thought for the same question (e.g., 10–40 samples)
Extracts the final answer from each
Returns the answer that appears most frequently (majority vote)

Self-consistency dramatically improves accuracy on hard reasoning benchmarks — often by 10–20 percentage points over single-sample CoT — at the cost of proportionally higher inference compute. It’s the go-to technique when accuracy matters more than latency or cost.

Tree of Thoughts (ToT)

Tree of Thoughts (Yao et al., 2023) generalises CoT from a linear chain into a search tree. The model:

Generates several candidate next steps (branches)
Evaluates each branch for promise (using the model itself as an evaluator)
Expands the most promising branches, pruning the rest
Repeats until a final answer is reached

ToT is particularly effective for tasks where early decisions determine the solution space — creative writing with constraints, multi-step planning, mathematical proofs. The tradeoff is significantly higher inference cost and latency; ToT is impractical for high-throughput applications.

CoT in Modern Frontier Models

By 2025–2026, chain-of-thought reasoning has moved from a prompting technique to a built-in training objective for frontier reasoning models. OpenAI’s o-series, Google’s Gemini Thinking, Anthropic’s extended thinking mode, and DeepSeek-R1 all train models to generate long reasoning chains (often called “thinking tokens” or “scratchpads”) before producing final outputs.

These test-time compute approaches differ from prompted CoT in a key way: the reasoning chain is part of the model’s learned behaviour, not injected by the user. The model has been trained with RL to produce chains that maximise answer correctness, making the reasoning process itself an optimised artifact rather than a human-specified format.

When to Use Chain-of-Thought

CoT helps most when:

The task involves multiple steps where order matters (arithmetic, algebra, logical deduction)
The task requires combining several facts from context into a conclusion
You need interpretable reasoning — to catch errors or explain outputs to stakeholders
The model is making mistakes on direct-answer prompting and you suspect it’s “jumping to conclusions”

CoT adds less value for:

Classification or extraction tasks where the answer is a direct lookup
Simple factual retrieval from well-established knowledge
High-latency-sensitive applications where generating reasoning tokens is too slow

Practical Implementation

For prompted CoT (without built-in reasoning models):

Few-shot: Include 3–5 examples with explicit reasoning steps matching your task domain
Zero-shot: Append “Let’s think step by step” or “Think through this carefully before answering”
Format control: Ask for reasoning inside <thinking> tags and the final answer separately — helps downstream parsing
Self-consistency: Use when accuracy is critical; route to majority vote across 10+ samples

For built-in reasoning models (o-series, extended thinking):

Enable extended thinking / reasoning mode via the model parameter
Set an appropriate thinking token budget — more tokens generally yield higher accuracy on hard problems but increase latency and cost
Don’t instruct the model on how to reason — it’s already trained to do this; just describe the task clearly

2025–2026: The Changing Value of CoT

CoT prompting is becoming less necessary as reasoning models mature. A June 2025 Wharton report (“The Decreasing Value of Chain of Thought in Prompting”, arXiv:2506.07142) found that CoT instructions provide only marginal accuracy gains for advanced reasoning models (o3, Gemini 2.5 Pro, Claude 4) on most tasks — and can increase latency 20–80% with negligible quality improvement. For these models, the reasoning chain is already built into training via RLVR; adding explicit CoT instructions in the prompt is largely redundant.

The practical implication: use prompted CoT for non-reasoning models; trust built-in reasoning for models trained with extended thinking. Mixing explicit CoT instructions into prompts for o-series or Claude’s extended thinking mode can actually interfere with the model’s learned reasoning strategy.

Extended CoT structures. The research frontier has moved beyond linear chains to richer structures:

Graph-of-Thought (GoT): The model generates a directed graph of reasoning steps rather than a linear chain, enabling parallel exploration of independent sub-problems before synthesis. More powerful than ToT for problems with independent sub-questions; higher token cost.
Program-of-Thought (PoT): For quantitative reasoning, the model generates executable code (Python) as its reasoning trace rather than natural language. The code is run, and the output is the answer — eliminating arithmetic errors that even CoT-equipped models make.
Mixture-of-Thoughts: Dynamically selects the most appropriate reasoning structure (chain, tree, or graph) for each input based on problem type.
Contrastive CoT: Includes both correct and incorrect reasoning examples in few-shot prompts, teaching the model to recognise and avoid faulty reasoning patterns.

Verification is the active frontier. A persistent reliability concern: CoT chains can be internally consistent but factually wrong — the model reasons confidently to a bad conclusion. Research into automated CoT verification (checking the reasoning graph for logical consistency and factual grounding) has intensified through 2025–2026, with approaches including computational graph verification and critic models trained to evaluate reasoning quality step-by-step.

How to Use — Zero-shot and few-shot CoT with extended thinking

python

from anthropic import Anthropic

client = Anthropic()

# Zero-shot CoT: "think step by step" instruction in the prompt
zero_shot = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": (
            "A store sells apples for $0.75 each and oranges for $1.20 each. "
            "Alice buys 4 apples and 3 oranges, pays with a $10 bill. "
            "How much change does she receive?\n\n"
            "Think step by step before giving your final answer."
        ),
    }],
)
print(zero_shot.content[0].text)

# Extended thinking: Claude reasons privately, then answers (Claude 3.7+)
thinking = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=8000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{
        "role": "user",
        "content": (
            "A train leaves City A at 08:00 travelling at 90 km/h. "
            "Another train leaves City B (300 km away) at 09:30 travelling at 120 km/h toward City A. "
            "At what time do they meet?"
        ),
    }],
)
for block in thinking.content:
    if block.type == "thinking":
        print("=== Reasoning ===\n", block.thinking)
    elif block.type == "text":
        print("=== Answer ===\n", block.text)

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams