R6 — Self-Ask

Decompose a compositional question into explicit follow-up sub-questions, answer each one (optionally via a tool or retriever), then compose the final answer from the intermediate answers.

Also Known As: Follow-Up Question Decomposition, Compositional Decomposition, Self-Ask Prompting. (Self-Ask-with-Search noted in Variants.)

Classification: Category III — Reasoning · Band III-A Linear chains · the question-decomposition pattern — sibling of R1/R2 CoT (unstructured chain) and R3 Plan-and-Solve (action plan); R6 is structured by sub-questions rather than by reasoning steps or action steps.


Intent

Close the compositionality gap — the failure mode in which a model can answer each sub-fact of a multi-hop question individually but cannot combine them — by forcing the model to ask and answer its own follow-up questions before composing the final answer.

Motivation

Press et al. (2022) named and measured a specific failure: models that know fact A and know fact B nonetheless get the question "A combined with B" wrong. They called the ratio of (can solve all sub-problems) to (can solve the whole) the compositionality gap, and found it does not close as model scale grows — bigger models retrieve facts better but do not compose them better. Scale alone does not fix this.

Why not? Because a single greedy decode of a compositional question commits to producing the final answer in one shot. The compositionality gap is a consequence of autoregressive stochastic sampling (mechanism 7): each token is sampled forward-only; once the answer token is committed, the model cannot revise it even if later reasoning steps contradict it. Naming the sub-questions before answering them forces the answer token to be deferred until all conditioning context is present. The model never explicitly names the sub-facts it needs; it tries to weave them into the answer in one pass, and any missed hop becomes a fluent-sounding hallucination. Chain-of-Thought (R1, R2) helps because emitting reasoning tokens creates room to surface intermediate facts — but CoT is unstructured prose, and the model can still skip the hop, restate the question, or rationalise the wrong answer.

Self-Ask's contribution is structural: it imposes a rigid Q/A scaffold — Follow up: … / Intermediate answer: … — that the model fills in turn by turn before emitting So the final answer is: …. The structure forces the decomposition to be named and checkable, and turns each sub-question into a clean point where an external tool (search, retriever, calculator) can substitute for the model's own recall. Press et al. report that this structured decomposition, with or without a tool, measurably narrows the gap where CoT alone does not.

This is distinct from R1/R2 CoT, R3 Plan-and-Solve, and R4 ReAct on three different axes. CoT emits free-form reasoning prose with no enforced structure; Self-Ask emits a Q/A tree the operator can parse. Plan-and-Solve plans an upfront sequence of actions and then executes them; Self-Ask grows a tree of questions incrementally, where each next sub-question depends on the answer to the previous one. ReAct interleaves Thought / Action / Observation around a tool, and the loop is action-shaped; Self-Ask's loop is question-shaped — sub-questions are the unit, tools are optional, and many Self-Ask runs are pure model recall.

Variants

The pattern has two named members differing in whether sub-questions are answered by the model alone or by an external tool:

  • Vanilla Self-Ask (Press et al., 2022). The same model that produces follow-up questions also produces intermediate answers from its own parametric knowledge. Pure prompting; no external dependencies. Works when the sub-facts are within the model's training data.
  • Self-Ask with Search. Each Intermediate answer: slot is filled by a search-engine call (Google, Bing, Tavily) keyed on the follow-up text. The original paper shows this lift accuracy substantially on time-sensitive and long-tail multi-hop questions. LangChain ships this as create_self_ask_with_search_agent with a single tool of name Intermediate Answer.

Both share the structural move — Q/A scaffold, named follow-ups, composition step. They differ only in who fills the intermediate-answer slots. A third common configuration — Self-Ask with retrieval — substitutes a K1 Vanilla RAG call for the search engine; treat that as a composition of R6 + K1 rather than a separate variant.

Applicability

Use Self-Ask when:

  • the question is compositional — two to four hops requiring distinct sub-facts;
  • the model can plausibly know each sub-fact in isolation but consistently misses the combination;
  • you want the decomposition to be visible for audit, debug, or operator inspection;
  • the sub-questions are answerable by clean recall or a single tool call each (search, RAG, calculator), not by exploratory action.

Do not use it when:

  • the question is single-hop — Self-Ask's scaffolding adds tokens with no compositional payoff; use R1 Zero-Shot CoT or even direct prompting;
  • the task is action-shaped (must touch the world: write a file, send a message, query an API in a stateful way) — use R4 ReAct, whose loop is built for tool-driven exploration;
  • the full set of sub-tasks is knowable upfront and they are largely independent — use R3 Plan-and-Solve (or R5 ReWOO for parallelism and token efficiency);
  • the task is open-ended creative work without a "correct" composed answer — use R8 Self-Refine;
  • the sub-question structure cannot be predicted at all and exploration drives the path — use R9 Tree of Thoughts.

Decision Criteria

R6 is right when the question is compositional, the sub-facts are individually retrievable, and you need the decomposition to be visible.

1. Measure the compositionality gap on your task. Run a labelled sample of multi-hop questions through (a) direct prompting and (b) Self-Ask. The gap = (% of sub-facts the model can answer in isolation) − (% of compound questions it can answer end-to-end). If the gap exceeds ~10 percentage points, Self-Ask's structural move is worth its tokens. If the gap is already small, the model is composing fine — keep R1 CoT.

2. Count the hops. Self-Ask shines at 2–4 hops. At 1 hop, the scaffold is overhead. Above ~5 hops the Q/A chain bloats and intermediate-answer errors compound; switch to R4 ReAct with explicit state, or R9 Tree of Thoughts if the path branches.

3. Pick a variant by where the sub-facts live. Sub-facts inside the model's training data $\to$ Vanilla Self-Ask (no tool). Sub-facts are time-sensitive, long-tail, or proprietary $\to$ Self-Ask with Search (or compose with K1 Vanilla RAG against your corpus). The tool choice is the main lever; the scaffold itself is the same.

4. Cost the chain. Each hop adds one round-trip — a follow-up + an intermediate answer + (optional) a tool call. Plan-and-Solve and ReWOO can be cheaper when the sub-questions are independent and parallelisable; Self-Ask is inherently sequential because hop N+1 depends on hop N's answer. If the hops are genuinely independent, prefer R5 ReWOO for the 5$\times$ token efficiency.

5. Bound the recursion. Self-Ask is a loop disguised as a Q/A scaffold — Are there follow-up questions? Yes / No. A miscalibrated model can say Yes indefinitely. Cap the number of follow-ups (typical: 4–6) via V9 Bounded Execution; force a final answer when the cap is hit.

Quick test — R6 is the right pattern when:

  • the question is compositional and the hop count is 2–4, and
  • the measured compositionality gap on your task exceeds the scaffold's token cost, and
  • each sub-question can be answered by clean recall or one tool call (not by exploratory action), and
  • you want the decomposition visible for audit.

If the hops are independent and parallelisable, choose R5 ReWOO. If the task is action-shaped or the path is genuinely unknown, choose R4 ReAct. If the question is single-hop, R1 Zero-Shot CoT is enough. If the sub-questions need retrieval against your own corpus rather than the web, compose Self-Ask with K1 Vanilla RAG instead of with a search engine.

Structure

  Compositional question Q
         │
         ▼
  ┌──────────────────────────────────────────────┐
  │ Decomposer (LLM)                              │
  │   "Are follow-up questions needed? Yes."      │
  │   "Follow up: <sub-question 1>"               │
  └──────────────────────────────────────────────┘
         │
         ▼
  ┌──────────────────────────────────────────────┐
  │ Sub-question answerer                         │
  │   model recall    (Vanilla)                   │
  │   search engine   (Self-Ask with Search)      │
  │   K1 retriever    (Self-Ask + RAG)            │
  │   → "Intermediate answer: <a₁>"               │
  └──────────────────────────────────────────────┘
         │
         ▼
  ┌──── more follow-ups? ────┐
  │  yes → loop (bounded V9) │
  │  no  ↓                   │
  └──────────────────────────┘
         │
         ▼
  Composer (LLM) ──▶ "So the final answer is: <A>"

Participants

ParticipantOwnsInput $\to$ OutputMust not
Decomposer (LLM)producing the next follow-up question given the original question and the intermediate answers so farQ + (Q₁, a₁) … (Qₖ, aₖ) $\to$ next sub-question Qₖ₊₁ or terminate signalanswer its own follow-up in the same step; the structural value is naming the sub-question before answering it. Conflating the two collapses Self-Ask back into CoT.
Sub-question answererproducing the intermediate answer to one sub-questionQₖ $\to$ aₖbe the same call as the Decomposer; even when the same model serves both roles, the prompt must shift so the model is only answering Qₖ, not extending the chain.
Tool (search / retriever / calculator) (optional)sourcing the sub-fact from outside the modelQₖ $\to$ factual spanbe invoked when the answer is already in the model's parametric knowledge with high confidence; calling out for every hop on a single-hop-knowable question wastes budget.
Termination checkdeciding when no more follow-ups are neededfull Q/A history $\to$ continue / stophand control back to the Decomposer indefinitely; this is where V9 Bounded Execution caps the loop.
Composer (LLM)producing the final answer from the intermediate answersQ + all (Qᵢ, aᵢ) $\to$ Areopen sub-questions or add unsupported claims; its job is composition, not re-decomposition.

Five narrow responsibilities. The pattern's reliability comes from the Decomposer / answerer separation: when the same call both grows the chain and fills it in, the model takes shortcuts — guessing the composed answer before all sub-facts are surfaced. Self-Ask's scaffold (Follow up: / Intermediate answer:) is the mechanism that enforces the separation even when one model plays both roles.

Collaborations

The Decomposer receives the compositional question Q and emits the first follow-up Qᵢ under the scaffold Are follow-up questions needed? Yes. Follow up: …. The Sub-question answerer fills the corresponding Intermediate answer: slot — either by the model's own recall (Vanilla variant), by an external search engine (Self-Ask with Search), or by a K1 retrieval call (Self-Ask + RAG). Control returns to the Decomposer, which inspects Q together with the accumulated (Qᵢ, aᵢ) pairs and emits the next follow-up or signals termination by switching to So the final answer is:. The Termination check enforces a hard cap (typically 4–6 hops, via V9) so a miscalibrated Decomposer cannot loop forever. When termination fires, the Composer reads Q and the full sub-Q/A trace and produces the final answer A. The trace itself is the audit artefact — every hop is named, inspectable, and individually re-runnable.

Consequences

Benefits

  • Measurably narrows the compositionality gap that scale and CoT alone do not close (Press et al., 2022).
  • Sub-questions and intermediate answers are visible — operators can inspect, audit, and re-run any single hop.
  • Each sub-question is a clean injection point for a tool, a retriever, or a fact-checker; the scaffold is the canonical pattern for adding search to a multi-hop chain.
  • The structure is model-agnostic and tool-agnostic — works with any capable generalist and any "give me the fact for this question" tool.

Costs

  • Token cost grows with the number of hops — each hop appends to the accumulated context, growing the KV cache (mechanism 3) so each subsequent LLM call attends over a longer prefix at O(seq_len²) cost (mechanism 2). The growth is super-linear, not linear, once context is substantial. Self-Ask with Search partially mitigates this: the tool returns a compact answer that replaces a long retrieved document.
  • Inherently sequential — hop N+1 depends on hop N's answer; cannot be parallelised the way R5 ReWOO can.
  • Adds output structure the consumer must parse; downstream code must extract the final answer from the scaffold reliably.

Risks and failure modes

  • Wrong decomposition. If the first follow-up names the wrong sub-fact, every later hop inherits the error. The Composer then produces a fluent answer to the wrong question.
  • Intermediate-answer hallucination. In the Vanilla variant, the same model that decomposed the question also fills in its own intermediate answers — and may hallucinate them with the same confidence as the original wrong answer. Self-Ask narrows the gap; it does not eliminate it.
  • Unbounded recursion. A miscalibrated Decomposer can keep saying Yes and growing the chain. Without V9 Bounded Execution, easy questions can spin out into ten-hop traces.
  • Format drift. The scaffold depends on exact tokens (Follow up:, Intermediate answer:, So the final answer is:). Stronger models sometimes paraphrase; the parser must tolerate small variation or the pipeline silently breaks.
  • Tool mismatch. Self-Ask with Search assumes the search engine returns short factual answers. Routing the follow-up to a tool that returns documents (rather than answers) requires an extra extraction step or the scaffold collapses.

Implementation Notes

  • The exemplars in the prompt do the heavy lifting — use Press et al.'s original four-exemplar template as a starting point; the scaffolding tokens must appear literally in the exemplars or the model will paraphrase them. The canonical Press et al. exemplar block is static across all queries in a domain — the canonical case for provider prefix caching (mechanism 5): a stable prefix above the variable question qualifies for the provider's KV-cache hit at ~10% of normal input token cost. Place the exemplar block at the top of the setup; under Anthropic caching rules a 1024+ token stable prefix reads at ~10% of normal input token cost.
  • Use Few-Shot CoT (R2) style exemplars showing the full Q/A scaffold including the Are follow-up questions needed? opener — Zero-Shot Self-Ask exists but is noticeably less reliable than the few-shot version.
  • For the Self-Ask with Search variant, choose a tool that returns answers not documents — Tavily's TavilyAnswer, Google's answer-box API, or a small wrapper that summarises top results. LangChain's create_self_ask_with_search_agent requires the tool to be named exactly Intermediate Answer.
  • Cap the number of follow-ups (typical 4–6) via V9 Bounded Execution; when the cap is hit, force the model into the Composer role with an explicit So the final answer is: continuation.
  • The Composer can be the same model and session as the Decomposer; the scaffold itself enforces the role switch. There is no need for a separate model unless the Composer needs domain knowledge the Decomposer lacks.
  • When sub-facts live in your own corpus rather than on the web, compose with K1 Vanilla RAG at each hop — Self-Ask becomes the outer control loop around per-hop retrieval.
  • Log the (Qᵢ, aᵢ) trace via V14 Trajectory Logging. The structured trace is far more useful than CoT prose for debugging compositional failures.

Implementation Sketch

LLM = configured session (model + setup + per-call prompt); code = wiring.

Composition: R6 chains a single Self-Ask session over a bounded loop. It composes with R2 Few-Shot CoT for the exemplar scaffold, with K1 Vanilla RAG or an external search tool to fill Intermediate answer: slots (the Self-Ask-with-Search variant), with V9 Bounded Execution to cap the follow-up loop, and with V14 Trajectory Logging to capture the per-hop trace. Signal-layer setup is S6 Output Template — the scaffold tokens are an output contract.

The chain:

#StepKindDraws on
1Build prompt P with Self-Ask exemplars + the question QcodeR2, S6
2Decomposer emits Follow up: Qₖ or So the final answer is: …LLMSelf-Ask session
3Branch — if final-answer prefix detected, jump to step 6code
4Answer Qₖ — model recall or tool call or K1 retrievalLLM (or code)K1 / search tool
5Append Intermediate answer: aₖ to the running prompt; check bound; loop to 2codeV9
6Extract the final answer from the So the final answer is: linecode
7Log the full (Qᵢ, aᵢ) tracecodeV14

Skeleton — the wiring only; each # LLM line is a configured session:

self_ask(question, max_hops=6):
    prompt = build_with_exemplars(question)                 # code  — R2 exemplars, S6 scaffold
    for hop in range(max_hops):                              # code  — V9 bound
        step = SelfAskSession(prompt)                        # LLM   — Decomposer or Composer
        if "So the final answer is:" in step:
            return extract_final(step), log_trace()          # code
        followup = parse_followup(step)                      # code
        answer = tool(followup) if use_search else SelfAskSession(answer_only_prompt(followup))
                                                             # LLM or code — sub-question answerer
        prompt += f"\nIntermediate answer: {answer}\n"       # code
    return force_compose(prompt), log_trace()                # LLM (forced Composer call)

The LLM sessions. Each LLM step must be set up before its first call.

SessionModelSetup — loaded once, before first callPer-call prompt wraps
Self-Askcapable generalist; same model serves Decomposer, Sub-question answerer (Vanilla variant), and Composer — the scaffold enforces the role switchrole ("you answer compositional questions by asking follow-ups"); the four canonical exemplars from Press et al. showing the full Are follow-up questions needed? / Follow up: / Intermediate answer: / So the final answer is: scaffold; output contract (S6) — must emit one of those four prefix tokensthe question Q, then progressively the accumulated Follow up: / Intermediate answer: history
Sub-question answerer (only if separated from Self-Ask session)small fast generalist, or a search/retrieval tool — not an LLM at all in the Self-Ask-with-Search variantrole: "answer the following short question with one factual sentence"; output contract: one sentence, no scaffoldingthe single sub-question Qₖ

Concretely, for the Self-Ask session the setup loaded once is: the four Press et al. exemplars (each showing a compositional question worked through 2–3 follow-ups to a So the final answer is: line), plus the instruction "Continue the same format for the new question below." The per-call prompt then carries the question Q and any accumulated (Qᵢ, aᵢ) pairs.

Specialist-model note. None — Self-Ask is pure prompting; any capable generalist suffices. The build dependency is the exemplar set, not a fine-tuned model: the four canonical exemplars from Press et al. (or domain-specific replacements) are the prompt artifact that does the heavy lifting. The Self-Ask-with-Search variant adds a build dependency on an answer-returning search tool (e.g., Tavily, Bing answer box, Google CSE with answer extraction) — not a documents-returning retriever. If your tool returns documents, wrap it with a one-line summariser or compose with K1 Vanilla RAG instead.

Open-Source Implementations

Known Uses

  • Multi-hop QA benchmarks — Self-Ask is a standard baseline alongside CoT and ReAct on HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, and Compositional Celebrities (the benchmark Press et al. introduced with the paper).
  • Search-augmented assistants — Self-Ask with Search is one of the canonical architectures behind early answer-engine prototypes; the Follow up: / Intermediate answer: scaffold is visible (sometimes literally) in trace logs from systems that decompose a user query into web lookups before composing.
  • Enterprise RAG over compositional questions — Self-Ask + K1 is a common pattern when a single retrieval call cannot return all the sub-facts a compound question needs, but each sub-question retrieves cleanly on its own.
  • LangChain production agents — the create_self_ask_with_search_agent constructor is widely used as the default scaffold for multi-hop factual QA with a single search tool.
  • Distinct from R1 Zero-Shot CoT and R2 Few-Shot CoT — CoT emits free-form reasoning prose; Self-Ask emits a structured Q/A scaffold (Follow up: / Intermediate answer:) that names each sub-question explicitly. Self-Ask narrows the compositionality gap CoT alone leaves open.
  • Distinct from R3 Plan-and-Solve — R3 plans a sequence of actions upfront before executing any of them; R6 grows a tree of questions incrementally, where each next sub-question depends on the answer to the previous one. R3 is action-shaped; R6 is question-shaped.
  • Distinct from R4 ReAct — R4's loop is Thought / Action / Observation around a tool, with the loop structure built for exploratory action; R6's loop is Follow up / Intermediate answer around a sub-question, with tools optional. Many Self-Ask runs are pure recall with no tool at all; ReAct without tools is not ReAct.
  • Distinct from R5 ReWOO — R5 plans all sub-tool-calls upfront with placeholder variables and executes them in parallel; R6 is inherently sequential because hop N+1 depends on hop N's answer. If the sub-questions are independent, R5 wins on token efficiency (5$\times$) and latency.
  • Composes with K1 Vanilla RAG — each Intermediate answer: slot is a clean injection point for a retrieval call against the operator's corpus. Self-Ask + K1 is the canonical pattern for compositional questions over a private knowledge base.
  • Composes with R2 Few-Shot CoT — the Self-Ask exemplars are a Few-Shot CoT prompt with a stricter output contract. Zero-Shot Self-Ask exists but is noticeably less reliable than the few-shot version.
  • Pairs with R4 ReAct at scale — when each sub-question itself requires multi-step tool use rather than a single lookup, the sub-question slot becomes a small ReAct sub-loop. The outer pattern is still R6 (question decomposition); the inner pattern is R4 (action loop).
  • Pairs with V9 Bounded Execution — the follow-up loop must be capped or a miscalibrated Decomposer will recurse on easy questions indefinitely.
  • Pairs with V14 Trajectory Logging — the structured (Qᵢ, aᵢ) trace is a high-value audit artefact; log it.
  • Pairs with S6 Output Template — the Follow up: / Intermediate answer: / So the final answer is: scaffold is a Signal-layer output contract that the Decomposer must honour exactly for the parser to work.

Sources

  • Press et al. (2022) — "Measuring and Narrowing the Compositionality Gap in Language Models" (arXiv 2210.03350; Findings of EMNLP 2023). The canonical reference; introduces both the compositionality-gap measurement and the Self-Ask method.
  • ofirpress/self-ask GitHub repository — code, data, prompts, and the Compositional Celebrities + Bamboogle benchmarks (github.com/ofirpress/self-ask).
  • LangChain documentation — "Self-ask with search" agent type and the create_self_ask_with_search_agent constructor (the production reference implementation).
  • Wei et al. (2022) — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv 2201.11903). The CoT baseline against which Self-Ask is measured.
  • Yao et al. (2022) — "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv 2210.03629). The sibling action-loop pattern.