R19 — Step-Back Prompting
Before answering a specific question, ask a more abstract version of it, derive the underlying principle or concept, and then specialise that principle back to the original — so reasoning starts from a level the model handles more reliably than the specific case.
Also Known As: Abstraction Prompting, Take-a-Step-Back, Principle-First Reasoning. (Step-Back as a retrieval-key transformation is the Step-Back variant of K2 Query Transformation; the same abstraction move, applied at a different layer.)
Classification: Category III — Reasoning · Band III-B Structured single-shot reasoning · a two-call pattern that lifts the question one level of abstraction before answering it.
Intent
Improve reasoning on specific, detail-heavy questions by first answering a strictly more abstract version of them — extracting the underlying principle, concept, or class of fact — and then applying that principle back to the specific case.
Motivation
A capable model often fails on a specific question while it can answer the general one underneath it. Ask "What happens to the pressure of an ideal gas if its temperature triples and its volume halves?" and a model may compute confidently and wrongly. Ask "What law relates the pressure, volume, and temperature of an ideal gas?" and it returns PV = nRT without hesitation. The specific question buries the principle in particulars; the general question surfaces it.
R1 Zero-Shot CoT and R2 Few-Shot CoT add intermediate reasoning steps but stay at the original level of abstraction — they reason within the specific problem (mechanism 7 — each step is a forward-only stochastic sampling conditioned on the specific tokens in the prompt). R3 Plan-and-Solve generates a plan: an ordered list of concrete steps. Neither of these lifts the problem. The recurring failure mode they leave unaddressed is one in which the model's relevant knowledge is stored at a more general level than the question asks about, and chain-of-thought reasoning over the specific surface produces fluent-but-wrong intermediate steps because the relevant principle was never made explicit. The mechanistic account is attention geometry (mechanism 1): the learned Q-K bilinear form (W_Q W_K^T applied to question embeddings) associates specific-detail tokens (temperatures, pressures, numeric values) with different K vectors than principle tokens (law names, conceptual categories). A highly specific question generates Q vectors that may not have high inner product with the K vectors encoding the relevant law. The step-back question generates Q vectors over the principle domain directly, yielding strong Q·K contractions with the stored law representations. Abstraction is a Q-vector repositioning operation (mechanism 1).
Zheng et al. (2023) — Take a Step Back — formalise the fix. Insert one preliminary call whose only job is to produce a more abstract question: the underlying concept, the relevant law, the general case. Answer that. Then answer the original question with the abstract answer in context. Empirically this is worth +7 points on MMLU Physics, +11 on Chemistry, +27 on TimeQA, +7 on MuSiQue (PaLM-2L). The defining claim of the pattern: a question one level of abstraction up is easier to answer correctly, and its answer is the principle the specific question needed all along.
The same abstraction move is fundamental enough to appear at a different layer of the system: K2 Query Transformation has a Step-Back variant in which the retrieval key is abstracted, so the retriever can fetch the underlying-principle passage even when the user asked a very specific question. R19 is the reasoning-chain application; K2's variant is the retrieval-key application. Same move, different participant being lifted.
Applicability
Use Step-Back Prompting when:
- the question is specific and detail-heavy but rests on a generalisable concept, law, or class the model knows;
- a single CoT pass produces confident-but-wrong intermediate steps that ignore the relevant principle;
- the task domain has named principles or concepts (physics laws, legal doctrines, biological mechanisms, accounting standards) and a successful answer reduces to "apply principle X";
- the system has a retrieval layer and the abstract answer is more likely to be in the corpus than the specific answer.
Do not use Step-Back when:
- the question is already at the right level of abstraction — lifting further produces a uselessly general answer (use R1 Zero-Shot CoT);
- the task is procedural and the model needs a plan, not a principle (use R3 Plan-and-Solve);
- the failure mode is search over a solution space rather than missing the right principle (use R9 Tree of Thoughts or R10 LATS);
- correctness depends on computation rather than concept retrieval (use R14 Program of Thoughts);
- the latency budget cannot absorb a second LLM call on every query.
Decision Criteria
R19 is right when CoT keeps producing fluent-but-wrong reasoning that elides the very principle the model knows under a different name.
1. Diagnose the failure mode. On a labelled set, take the model's CoT trace on each failed case. Ask: did the model know the relevant principle, or not know it? If the principle is missing — the trace never names it, but the model would name it instantly when asked the general question — R19 fits. If the principle is named in the trace but applied wrongly, the failure is computational; use R14 Program of Thoughts or step the model up.
2. Test the abstract-answer hit rate. Manually rewrite 20 failing queries into their step-back form and measure: can the model answer the abstract version correctly? If yes for $\geq$70% of them, R19 will lift the specific-case accuracy too. If the abstract version also fails, the model lacks the underlying knowledge and K1 Vanilla RAG or K5 Adaptive RAG is the better intervention.
3. Cost the second call. R19 doubles the LLM call count per question (the Abstractor + the Specialiser). Confirm the latency budget tolerates it; confirm the per-query cost increase is acceptable. If only some queries need it, O3 Routing (route by question type) keeps R19 off the path for the rest.
4. Pick where the abstraction lives. Reasoning-chain (R19) or retrieval-key (K2 Step-Back variant)? Use R19 when the model already has the principle in its weights and you need it surfaced explicitly. Use the K2 variant when the principle lives in a corpus and you need to retrieve the right passage. Both, when the principle lives in the corpus and the reasoning step is non-trivial.
5. Few-shot the Abstractor. The single largest tuning lever is the prompt that generates the step-back question. Without 3–5 worked examples ("specific: … $\to$ step-back: …"), the Abstractor under-abstracts or over-abstracts. Treat the Abstractor's few-shot bank as the pattern's main artefact.
Quick test — R19 is the right pattern when:
- the model's CoT trace on failing queries omits a principle it can recite when asked directly, and
- abstract-form versions of those queries succeed on the same model, and
- the latency budget tolerates two LLM calls per question, and
- the domain has named principles or concepts the abstraction can land on.
If the abstract version also fails, the model lacks the knowledge — use K1 Vanilla RAG or K5 Adaptive RAG. If the failure is computation rather than concept missing, use R14 Program of Thoughts. If you need a step-by-step plan rather than an underlying principle, use R3 Plan-and-Solve. If you need to search a space of approaches, use R9 Tree of Thoughts.
Structure
Specific question
│
▼
Abstractor (LLM) ──▶ Step-back question (one level more general)
│
▼
Principle Reasoner (LLM, often same model) ──▶ Principle / general answer
│
▼
Specialiser (LLM, often same model) ──▶ Specific answer
│ ▲
│ │
└── original question + principle as context ──┘
The shape is an inverted pyramid: lift, derive, descend. Two LLM calls minimum (one if Principle and Specialiser are fused into a single grounded-generation step), with the original question carried through the lift so the specialisation step has both the principle and the case to apply it to.
Participants
| Participant | Owns | Input $\to$ Output | Must not |
|---|---|---|---|
| Specific question | the case to be answered | — $\to$ the user's question | be skipped or paraphrased mid-flow — the Specialiser must answer the original question, not a paraphrase of it. |
| Abstractor (LLM) | producing the step-back question | specific question $\to$ more-abstract question | answer the question. Its only job is to name the underlying concept / law / class. An Abstractor that also answers degenerates the pattern into CoT. |
| Step-back question | the lifted version | — $\to$ general question | be so abstract that the answer cannot be specialised back, or so close to the specific that no abstraction has happened. The few-shot examples are what calibrate this. |
| Principle Reasoner (LLM) | answering the abstract question | step-back question (+ optional retrieved context) $\to$ principle | apply the principle to the specific — that is the Specialiser's job. Keep this answer general; specifics here cause confusion. |
| Specialiser (LLM) | applying the principle to the specific | original question + principle $\to$ specific answer | re-derive the principle, or ignore it. Both are common failure modes: the model can re-justify a wrong specific answer despite the principle being in context. |
| Few-shot examples | calibrating the Abstractor | — $\to$ 3–5 (specific, step-back) pairs | be generic — the examples must come from the same domain as the queries. Cross-domain examples teach the wrong level of abstraction. |
Five participants with one prompt artefact. The Abstractor / Reasoner / Specialiser are typically the same model in three different configured sessions — what differs is the role and the prompt, not the weights.
Collaborations
A specific question arrives. The Abstractor — primed with 3–5 worked (specific $\to$ step-back) examples from the task domain — produces one more-abstract question that names the underlying concept, law, or class the specific case belongs to. The Principle Reasoner answers that step-back question, optionally with retrieved context if the system has a retrieval layer. The Specialiser then receives the original question and the principle as context, and produces the specific answer by applying the principle to the case. In retrieval-augmented systems the principle is often retrieved (K1) rather than reasoned out, and the Specialiser becomes a grounded-generation step over both retrieval pools (original question + step-back question). Bounded recovery is rarely needed because the pattern is two-shot, not iterative — if either the abstract answer or the specialisation is wrong, R19 fails clean.
Consequences
Benefits
- Surfaces principles the model knows but does not spontaneously deploy under chain-of-thought.
- Composes cleanly with retrieval — the step-back question often retrieves the canonical principle passage where the specific question does not.
- Cheap: two LLM calls, no search, no iteration.
- Inspectable: the principle is a separate intermediate output the operator can audit.
Costs
- Doubles per-query LLM calls. On a typical reasoning task, +1 latency unit and ~2$\times$ token cost vs Zero-Shot CoT.
- Demands a few-shot bank per domain. Without it the Abstractor either under-abstracts (rephrasing) or over-abstracts (uselessly general).
- The Specialiser is non-trivial: the model must apply a principle, not just recite it. Some failures persist after the principle is in context.
Risks and failure modes
- Wrong abstraction. The Abstractor lifts the question along the wrong axis — abstracting time when the relevant principle is geometric. The Reasoner then answers an irrelevant general question. Mitigation: domain-matched few-shot examples.
- Specialisation collapse. The Specialiser receives the principle but ignores it, re-deriving a wrong specific answer from scratch. Mitigation: explicit prompt instruction ("apply the principle in the context to the question; do not re-derive it").
- Trivial abstraction. The step-back question is a near-paraphrase of the original; no lift has happened. Mitigation: in the few-shot examples, choose pairs whose abstraction is clearly two or more levels up.
- Confident wrong principle. The Reasoner asserts a principle that does not hold; the Specialiser dutifully applies it. The final answer is fluent and structurally correct but factually false. R19 has no internal mechanism to catch this — pair with K1 Vanilla RAG so the principle is retrieved rather than asserted, or with V15 LLM-as-Judge to grade the principle before specialisation.
Implementation Notes
- The Abstractor and Specialiser can be the same model in two sessions; the setup is what differs. Both can run on the system's main generalist — Step-Back's value is in the structure, not the model.
- The few-shot examples are the pattern's centre of gravity. Treat them as a Signal-layer artefact: version them, evaluate them, regenerate them when the task domain shifts. The Abstractor's few-shot bank is static across all calls for a given domain — a cacheable prefix (mechanism 5 — provider caches key on stable prefixes; a 1024+ token stable setup qualifies for a ~5-min TTL cache read at ~10% of normal input cost). Place the domain-specific few-shot examples in the Abstractor's setup; under Anthropic caching rules a 1024+ token stable setup reads at ~10% of normal input token cost on a cache hit.
- Pair with retrieval (K1 or K2's Step-Back variant) on knowledge-intensive tasks: retrieve once on the original query, once on the step-back query, concatenate, generate. The paper does exactly this for TimeQA and MuSiQue; the +27 / +7 gains are with retrieval, not weights alone.
- The Specialiser prompt should explicitly say "apply the principle from the context; do not re-derive it from the original question". Without that instruction the model often duplicates work and reaches different conclusions.
- A degenerate failure to watch for: the Abstractor asks a step-back question whose answer is already the specific answer ("What was X's height in 1995?" $\to$ "What was X's height history?"). The lift must move to a concept, not a broader fact.
- For systems already running CoT, R19 is a one-prompt upgrade — frame it as "the model's first call generates a step-back question; the second call answers the original with that question's answer in context."
Implementation Sketch
LLM= configured session (model + setup + per-call prompt);code= wiring.
Composition: R19 chains an Abstractor and a Specialiser — same model, different sessions. In retrieval-augmented use it composes with K1 Vanilla RAG (retrieve on both queries) and optionally K2 Query Transformation (the Step-Back variant inside the retriever). The Abstractor's calibration is a Signal-layer concern (S2 Few-Shot, S6 Output Template). For accuracy-critical use, V15 LLM-as-Judge can grade the principle before specialisation.
The chain:
| # | Step | Kind | Draws on |
|---|---|---|---|
| 1 | Abstract — generate the step-back question | LLM | Abstractor session |
| 2 | (optional) Retrieve on the original and on the step-back question | code | K1 (twice) |
| 3 | Reason — answer the step-back question (using retrieved context if present) | LLM | Principle Reasoner session |
| 4 | (optional) Judge — grade the principle before applying it | LLM | V15 LLM-as-Judge |
| 5 | Specialise — answer the original question with the principle in context | LLM | Specialiser session |
In the retrieval-augmented form (the paper's strongest result), steps 2 and 3 can collapse: retrieve on both queries, concatenate the chunks, and the Specialiser does grounded generation over the lot. Two LLM calls (Abstractor + Specialiser) plus two retrieval calls.
Skeleton:
step_back(question):
sb_q = Abstractor(question) # LLM
# optional retrieval — K1 invoked twice
ctx_orig = K1.retrieve(question) # code
ctx_sb = K1.retrieve(sb_q) # code
principle = PrincipleReasoner(sb_q, ctx_sb) # LLM
return Specialiser(question, principle, ctx_orig) # LLM
The LLM sessions:
| Session | Model | Setup — loaded once, before first call | Per-call prompt wraps |
|---|---|---|---|
| Abstractor | the system's main generalist (or a small fast model — abstraction is cheap) | role ("you generate a more-abstract version of the user's question that names the underlying concept, law, or class — you do not answer it"); 3–5 few-shot pairs (specific: … → step-back: …) from the same domain as the queries; output contract (one line, no preamble) | the specific question |
| Principle Reasoner | the system's main generalist | role ("you answer general questions about concepts, laws, or classes — keep your answer principled and not tied to specific cases"); if retrieval present, the grounding rules (S6 citation contract) | the step-back question + (optional) retrieved context |
| Specialiser | the system's main generalist | role ("you apply a stated principle to a specific case — use the principle in the context; do not re-derive it"); answer format (S6) | original question + principle (+ optional retrieved context on the original) |
Concretely, the Abstractor's setup carries something like: "Your job is to paraphrase the question into a more generic step-back question, easier to answer. Examples: Specific: Could the members of The Police perform lawful arrests? → Step-back: What can the members of The Police do? …" That few-shot block is the entire calibration mechanism.
Specialist-model note. None — a capable generalist suffices for all three sessions, and they are typically the same generalist in three configured sessions. The pattern's leverage is in the prompt artefact (the Abstractor's few-shot bank), not in a fine-tuned model. A single weaker model can serve the Abstractor (the abstraction step is undemanding) while the Specialiser uses the stronger model — a cost optimisation, not a build dependency.
Open-Source Implementations
- LangChain
stepback-qa-promptingtemplate —github.com/langchain-ai/langchain/tree/v0.1/templates/stepback-qa-prompting— the canonical reference implementation: few-shot Abstractor, dual retrieval on original + step-back queries, single-call Specialiser. Lives under the v0.1 templates tree (LangChain templates were not carried into v0.2+; the v0.1 ref is stable). - Original paper code — no official Google / DeepMind release accompanies the Zheng et al. paper; the technique is prompt-only, so the paper itself is the reference. Multiple independent reproductions exist on GitHub (e.g. small reproduction studies on physics QA, advanced-RAG repos integrating step-back as a query-rewriting stage), but none is canonical.
learnprompting.org/vocabulary/step-back_prompting— the clearest public walkthrough with worked examples (not a library, but the closest thing to a normative spec outside the paper).- General note — Step-Back is a prompt pattern, not a runtime architecture. There is no "Step-Back library" the way LangChain is a library; the LangChain template and its many derivatives are wrappers around the two-call structure plus a few-shot bank. If you want the pattern, build the two sessions and the few-shot bank — that is the implementation.
Known Uses
- LangChain-based RAG assistants ship the step-back template as an option for knowledge-intensive QA — the dual-retrieval (original + step-back) recipe is standard practice on enterprise RAG stacks where the corpus contains both specific facts and the principles those facts instantiate.
- Advanced-RAG pipelines combine HyDE (K2 variant) and Step-Back (K2 variant or R19) for complex query rewriting; community repos demonstrate the combination for legal, medical, and financial QA.
- Reasoning benchmarks — TimeQA and MuSiQue reproductions consistently use Step-Back as a baseline for multi-hop and temporal reasoning, where the gains over Zero-Shot CoT are largest.
- Educational and tutoring agents use the pattern as a pedagogical scaffold: the step-back question is itself surfaced to the learner ("first, what general principle applies here?") before the specific answer is given.
Related Patterns
- Shares mechanism with K2 Query Transformation (Step-Back variant) — the same abstraction move applied to the retrieval key rather than the reasoning chain. R19 lifts the question the LLM is reasoning about; K2's variant lifts the question the retriever is searching for. The two compose naturally in a RAG stack: K2 lifts the search, R19 lifts the reasoning, both can run in the same query.
- Distinct from R1 Zero-Shot CoT — R1 reasons step-by-step at the original level of abstraction; R19 lifts the level once, then reasons. R19 uses CoT internally inside the Reasoner / Specialiser sessions, but the abstract-then-specialise move is what makes it a distinct pattern.
- Distinct from R3 Plan-and-Solve — R3 generates a step-by-step plan (sequence of concrete sub-actions); R19 generates a more-abstract question (one principle to apply). Plans are procedural; step-backs are conceptual. They can compose: a plan whose first step is "identify the relevant principle" is essentially R3 wrapping R19.
- Composes with K1 Vanilla RAG — retrieving on both the original and the step-back query is the paper's strongest configuration; the step-back retrieval often finds the principle passage that the specific query misses.
- Composes with K5 Adaptive RAG — when the abstract answer is also out-of-corpus, K5's fallback path handles it; R19 raises the floor, K5 catches the residual misses.
- Pairs with S2 Few-Shot and S6 Output Template — the Abstractor's few-shot bank and the Specialiser's "apply, do not re-derive" output contract are Signal-layer artefacts that carry most of the pattern's quality.
- Pairs with V15 LLM-as-Judge — in accuracy-critical use, grading the principle before specialisation catches the confident wrong principle failure mode that R19 alone cannot.
- Note on fundamentality — R19 is fundamental despite using CoT internally because the abstract-then-specialise move introduces a distinct participant (the Abstractor) and a distinct Structure (inverted pyramid) absent from R1/R2. That the same move also appears as a K2 variant — applied to a different participant in a different category — is confirming evidence of its fundamentality, not a reason to merge: the two applications cannot be collapsed into one pattern because the participant being lifted is different (reasoning chain vs retrieval key).
Sources
- Zheng, H. S., Mishra, S., Chen, X., Cheng, H.-T., Chi, E. H., Le, Q. V., & Zhou, D. (2023) — "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models" (arXiv 2310.06117; ICLR 2024).
- LangChain
stepback-qa-promptingtemplate (v0.1 branch) — reference implementation in code. - Learn Prompting — Step-Back Prompting vocabulary entry (clearest public walkthrough; non-paper).
- Unite.AI — "Analogical & Step-Back Prompting: A Dive into Recent Advancements by Google DeepMind" (independent review of the technique).