V5 — Guardrail Layering

Apply external, code-enforced safety and validation checks at four distinct points in the agent's execution — user input, before each tool call, after each tool response, and on the final output — so that no single failure point can compromise the system.

Also Known As: Multi-Point Safety, Defense in Depth for LLMs, Input-Output Filtering, Four-Point Guardrails, I/O Guards.

Classification: Category V — Reliability · Band V-A Safety and Security · the external-enforcement pattern that wraps an agent's I/O surface with deterministic checks at every boundary where untrusted or sensitive data crosses.


Intent

Place the safety perimeter in code, not in the model. Intercept and validate at every boundary the agent crosses — input from the user, the parameters of each tool call, the response of each tool, and the final output to the user — so the system tolerates the model failing any single check.

Motivation

The pervasive anti-pattern is output-only guardrails (catalogued as A5 in the taxonomy): a single content filter on the final response, after which the system is declared safe. That posture fails in three predictable ways. User-input failures — adversarial inputs (prompt injection, jailbreaks, PII the system must not retain) reach the model unchecked, corrupting the reasoning context before any output is produced. Tool-call failures — the model invokes a tool with parameters that exceed its scope (deleting the wrong record, paying the wrong account, querying data outside the user's permissions) and the call goes through because no pre-call check exists. Tool-response failures — a malicious or compromised tool returns content carrying hidden instructions, malformed schemas, or sensitive data the model now treats as authoritative context. Each of these failures can produce a final output that looks perfectly safe but rides on a corrupted intermediate. The output guard sees nothing wrong because the damage was already done upstream.

The fix is structural and well-established as the security principle of defense in depth: independent checks at every boundary, no single layer load-bearing, each layer's failure tolerated by the next. For LLM agents the four boundaries are not theoretical — they are the structural seams of the execution model. User input enters; the model emits a tool call; the tool returns a response; the model emits a final output. Guards belong at each seam. This is the same structural claim the OpenAI Agents SDK encodes (input guardrails, output guardrails, tool guardrails — pre and post), the Guardrails AI framework encodes (Input Guards and Output Guards as first-class objects with validators), NVIDIA NeMo Guardrails encodes (input rails, dialog rails, output rails, retrieval rails, execution rails), and Microsoft Prompt Shields enforces (User Prompt attacks and Document attacks treated as distinct input surfaces). Four production frameworks; four expressions of the same four-points structure.

V5 is distinct from S5 Constraint Framing, and the distinction is load-bearing. S5 instructs the model in prompt; V5 enforces in code. S5 is probabilistic — the negation-failure literature (NeQA, García-Ferrero et al., the pink-elephant effect) shows model self-restraint fails systematically on the prohibition cases that matter most. The mechanism is that the model's output distribution is a softmax over all possible next tokens — a prohibition in the system prompt shifts probability mass away from prohibited tokens but cannot set that mass to zero; stochastic sampling can still select a prohibited token, especially on adversarially crafted inputs designed to shift the distribution (mechanism 7). Code-enforced guards are deterministic for well-specified violations because they are not probability distributions. V5 is deterministic for well-specified violations: a regex catches the credit-card pattern whether the model "decided" to emit it or not; a JSON-schema check rejects malformed tool parameters whether the model "intended" them or not. The two are complementary, not substitutes, and the standing rule is never rely on S5 alone for a violation whose cost is catastrophic. S5 is the in-prompt prohibition layer; V5 is its external-enforcement counterpart. Use both.

Applicability

Use V5 when:

  • the agent invokes external tools (nearly all production agents qualify);
  • the agent processes user-supplied or third-party text (web pages, emails, uploaded documents, API responses);
  • the domain is safety-critical, regulated, or carries reputational tail risk (healthcare, finance, legal, public-facing brand);
  • the agent crosses the V3 Lethal Trifecta surfaces — private data, untrusted content, external communication — in any combination;
  • a compliance, security, or brand auditor must be able to point to the enforcement mechanism, not just a model instruction, when asked how a violation is prevented.

Do not use when:

  • the agent has no tools and no user-supplied input (a pure batch-generation pipeline from a trusted corpus): a single output guard plus V16 Offline Eval suffices.
  • the latency budget cannot tolerate four extra checks and the threat model genuinely demands none (rare; usually a misread of the threat model — re-evaluate against V3 Rule of Two).
  • the guards would be the only safety layer: V5 in isolation is brittle, because every guard has a false-negative rate. Pair with S5 Constraint Framing in-prompt, V6 Prompt Injection Shield for injection specifically, V14 Trajectory Logging for audit, and V1 Human-in-the-Loop for the violations a guard cannot decide.

Decision Criteria

V5 is right whenever the cost of any unchecked boundary exceeds the cost of a guard on it — and that threshold is met by nearly every agent with tools or untrusted input.

1. Count the boundaries the agent crosses. Score each of the four points present:

  • User input present? (almost always yes — score 1)
  • Tool calls present? (yes if the agent uses any tools — score 1 for pre-call, 1 for post-call)
  • Final output to user? (yes for any conversational agent — score 1) Three or four boundaries scored: V5 is mandatory. Two or fewer: a narrower set of guards may suffice — but check against the Lethal Trifecta below.

2. Score the Lethal Trifecta (V3) exposure. Does the agent combine any two of: (a) private data access, (b) untrusted-content exposure, (c) external communication? Two or more triggers means V5 is non-negotiable regardless of measured incident rate — pair with V4 Dual LLM for architectural separation. (Simon Willison, "The lethal trifecta.")

3. Measure the bad-outcome rates. On a labelled adversarial test set, measure:

  • Injection bypass rate — what % of injection attempts produce a non-trivial behaviour change? > 1% means input/response guards are paying for themselves.
  • Out-of-scope tool call rate — what % of tool calls fall outside the declared policy (wrong account, wrong tenant, wrong dataset)? > 0.5% means pre-call guards are mandatory.
  • Output policy-violation rate — what % of outputs contain PII, prohibited claims, or harmful content? > 0.1% in regulated domains means output guards are not optional.

If any of these exceed the reliability budget, the corresponding guard is required by data, not by principle.

4. Pick a build mode. Three options trade integration depth for time-to-deploy:

  • Rule-based — regex, JSON-schema, allow-lists, blocklists. Fast, deterministic, cheap; brittle on semantic violations. Use for structural cases (PII patterns, parameter scope, schema conformance).
  • Classifier-based — small fine-tuned models (Llama Guard, Llama Prompt Guard, NVIDIA NeMo content safety models). Higher recall on semantic threats; specialist build dependency. Use for content-safety and prompt-injection classes.
  • LLM-as-judge — an LLM call evaluates against a rubric (this is V15 LLM-as-Judge invoked as a guard). Most flexible; highest latency and cost. Use sparingly, on the highest-stakes outputs only.

Most production systems compose all three: structural checks first (cheap, deterministic), then classifiers (medium cost), then LLM-as-judge on the residual. An LLM-as-judge guard invokes a full generative session with O(n²) attention computation; use sparingly and only on the highest-stakes final outputs where latency budget permits (mechanism 2).

5. Set the fail-mode discipline. For each guard, pick fail-closed (reject on uncertainty) or fail-open (pass on uncertainty) explicitly. Safety-critical contexts default to fail-closed at every boundary; productivity contexts may fail-open on input guards to preserve UX, but should fail-closed on tool-call and output guards. The default must be in the design, not the operator's runtime mood.

Quick test — V5 is the right pattern when:

  • the agent has at least three of the four boundaries (user input, tool call, tool response, final output), and
  • a single unchecked boundary's worst-case cost exceeds the cost of running a guard there, and
  • the guard set can be specified concretely enough to test against an adversarial labelled set, and
  • guard decisions can be logged (V14) so false positives and false negatives can be tuned post-hoc.

If the agent has no tools and no untrusted input, a single output guard plus offline eval (V16) is sufficient. If the threat is specifically prompt injection, V6 Prompt Injection Shield layers the injection-specific defenses inside V5's input-and-response guards. If the violation surface is open-ended and cannot be enumerated, lean harder on V4 Dual LLM and S9 Constitutional Framing alongside V5 — guards alone cannot catch what cannot be specified.

Structure

                ┌──────────────────────────────────────────────────────┐
                │                       AGENT                          │
                │                                                      │
  user ──▶ [1] Input Guard ──▶ model ──▶ tool call                     │
                                          │                            │
                                          ▼                            │
                                     [2] Pre-Call Guard                │
                                          │                            │
                                          ▼                            │
                                       tool ──▶ tool response          │
                                                    │                  │
                                                    ▼                  │
                                          [3] Response Guard           │
                                                    │                  │
                                                    ▼                  │
                                                model ──▶ final output │
                                                                │      │
                └────────────────────────────────────────────── │ ─────┘
                                                                ▼
                                                        [4] Output Guard
                                                                │
                                                                ▼
                                                              user

  (every guard also writes to V14 Trajectory Log)

Four guard points, each independently testable, each logged. Guards [2] and [3] repeat for every tool call in a session — they are not one-shot; they sit in the loop.

Participants

ParticipantOwnsInput $\to$ OutputMust not
Input Guardthe verdict on incoming user textraw user input $\to$ pass / sanitise / rejectlook at agent state or tool data — it grades the input alone. An Input Guard that reasons about agent context has lost its independence and cannot fail safe.
Pre-Call Guardthe verdict on a proposed tool invocation(tool name + parameters + agent context) $\to$ allow / deny / require-approvalexecute the tool or modify its parameters silently. If parameters need to change, the guard must reject and let the agent retry — silent mutation hides the policy from the audit log.
Response Guardthe verdict on a tool's response before it enters context(tool response + originating call) $\to$ sanitised content or rejectiontrust schema or content unchecked. A tool response is untrusted content until validated, regardless of which tool produced it (A14 Trust Handoff).
Output Guardthe verdict on the final agent response(response + originating query) $\to$ release / redact / blockbe the only guard. An Output Guard alone cannot detect upstream corruption; if it is the only layer, the system is in the A5 anti-pattern.
Policy Registry (optional)the declarative rules each guard enforces— $\to$ policy bundle per guard pointbe implicit in code. Policies must be a named artifact (a config file, an AgentSpec, a .rail file) so compliance and security can read them.
Guard Loggerrecording every guard decision(guard, verdict, evidence) $\to$ V14 trajectory entrydrop the evidence. A "rejected" verdict without the matched rule is useless for tuning false positives.

Each guard sits at exactly one boundary and grades exactly the data crossing that boundary. The pattern's reliability comes from that separation: a single shared "safety module" that runs at all four points is not V5 — it is one guard called four times, and its failure is uniformly correlated across all boundaries.

Collaborations

A user message arrives. The Input Guard runs first; on rejection, the agent never sees the input. If the input passes, the model reasons and may emit a tool call. Before the call executes, the Pre-Call Guard evaluates the tool name, the parameters, and the agent context against policy — it may allow, deny, or escalate to a human (V1). The tool runs; its response is intercepted by the Response Guard, which validates schema and screens content for injection patterns and policy violations before the response enters the model's context. The loop repeats for every tool call. When the model emits its final response, the Output Guard runs the last check: PII redaction, prohibited-claim detection, harmful-content classification. Every guard writes its decision and evidence to the V14 trajectory log. When a guard's verdict is uncertain, the fail-mode discipline decides: fail-closed escalates to V1 Human-in-the-Loop; fail-open passes with a logged warning. The Policy Registry is the single source of truth for what each guard enforces — guards do not hardcode rules.

Consequences

Benefits

  • Defense in depth: no single layer is load-bearing; one guard's false negative is caught by the next layer's check.
  • Deterministic enforcement on well-specified violations: schema, scope, PII patterns, allow-listed tools — these need not be trusted to the model.
  • Audit-ready: every guard decision is logged with its rule and evidence; compliance can read the policy artifact directly.
  • Independent failure surfaces: a corrupted Input Guard does not corrupt the Output Guard, because they share neither code nor policy.
  • A clean separation of "model behaviour" from "system behaviour" — the agent can be evaluated, the guards can be evaluated, and the composition can be evaluated.

Costs

  • Latency at every boundary: four extra checks per turn, more for multi-tool turns.
  • False positives reject legitimate user requests and break trust faster than false negatives break safety.
  • Policy maintenance is a real engineering load — every new tool, every new domain, every new threat class is a policy update.
  • LLM-based guards (Llama Guard, LLM-as-judge) add token cost and may themselves be vulnerable to injection.

Risks and failure modes

  • Guard overreach — input guards tuned conservatively reject valid edge-case queries; users learn to phrase around the guard, defeating its purpose.
  • Shared-failure illusion — running the same model with the same prompt at all four points feels like four guards but is one. Diversify the implementations.
  • Output-only collapse (A5) — under deadline pressure, three layers are dropped and only the output guard ships; the system regresses to the anti-pattern.
  • Guard rot — without V17 Online Eval on guard-trigger rates, guards drift out of calibration as the model and the threat landscape evolve.
  • False sense of security — V5 cannot catch what cannot be specified. Open-ended manipulation, novel injection vectors, and policy gaps slip through. Pair with V6, V4, and V1.

Implementation Notes

  • Use different models / different rules at each layer. The point of layering is independent failure; identical guards at four points provide one guard's worth of safety with four guards' worth of latency. Identical model-family guards share the same learned attention bilinear form (mechanism 1); adversarial inputs that shift one guard's probability distribution toward passing will have correlated effects on other guards from the same family, defeating the independence property that makes layering valuable.
  • Pre-Call Guards should validate parameters against the tool's declared scope, not against the model's intent. "Did the model mean to do this?" is unanswerable; "is this parameter within the allow-list?" is decidable.
  • Response Guards must treat every tool response as untrusted content — including responses from internal tools, because internal tools may carry external data (a database row containing a user-supplied string is external content the moment it enters context).
  • Output Guards should redact rather than reject when redaction preserves user value. Stripping a phone number from an otherwise-useful answer is better than refusing to answer; rejecting an answer that contains an active prompt-injection payload is the right call.
  • Log the evidence, not just the verdict. A guard that rejected because "policy violation" is useless for tuning; a guard that rejected because "matched rule PII-001 on substring 4111-1111-..." is tunable.
  • Fail-closed by default in regulated domains; fail-open by exception, with an explicit owner. The exception list goes in the Policy Registry.
  • Separate the policy artifact from the guard code. Policies change weekly; guard infrastructure changes monthly. They have different cadences and different reviewers.
  • Pair with V7 AgentSpec when policies grow beyond a handful — the declarative spec becomes the Policy Registry.

Implementation Sketch

LLM = configured session (model + setup + per-call prompt); code = wiring.

Composition: V5 wraps the agent's I/O surface. It composes with S5 (in-prompt prohibitions inside the model itself), V6 (injection-specific defenses inside the input and response guards), V14 (every guard writes a trajectory event), V15 (an LLM-as-judge can be invoked as a guard for semantic checks), V7 (the declarative policy each guard enforces), and V1 (the escalation target when a guard is uncertain). It is orthogonal to the agent's reasoning pattern — works with R4, R5, R7, or any other.

The chain — per turn:

#StepKindDraws on
1Input Guard — screen user messageLLM (or rule)Input Guard session (or rule set)
2Branch — reject / sanitise / passcode
3Agent reasons; may emit a tool callLLMAgent session
4Pre-Call Guard — validate tool name + params against policycode (or LLM)Policy Registry; V7
5Branch — allow / deny / escalate to V1codeV1
6Tool executescode
7Response Guard — validate schema, screen contentLLM (or rule)Response Guard session (or rule set); V6
8Branch — sanitise / reject / pass into agent contextcode
9Repeat 3–8 until agent emits final outputcodeV9 (bound the loop)
10Output Guard — final-output check (PII, policy, harm)LLM (or rule)Output Guard session (or rule set)
11Branch — release / redact / blockcode
12Every guard writes its decision + evidencecodeV14

Skeleton — the wiring; each # LLM line is a configured session:

handle_turn(user_msg, policy):
    verdict = InputGuard(user_msg, policy.input) ─────── # LLM (or rule)
    if verdict.reject: log(V14); return refusal
    user_msg = verdict.sanitised

    for step in V9.bounded_loop():                       # code
        action = Agent(context, user_msg) ─────────────── # LLM
        if action.is_tool_call:
            ok = PreCallGuard(action, policy.tool) ────── # code (or LLM)
            log(V14, ok)
            if not ok.allow: continue (deny) or V1.escalate
            response = run_tool(action)                  # code
            response = ResponseGuard(response, policy.response) ── # LLM (or rule)
            log(V14, response.verdict)
            context.append(response.sanitised)
        else:
            final = action.output
            break

    out = OutputGuard(final, policy.output) ───────────── # LLM (or rule)
    log(V14, out.verdict)
    return out.release or out.redact or refusal

The LLM sessions — only the guards that are LLM-based have rows; rule-based guards (regex, JSON-schema, allow-lists) carry no LLM cost.

SessionModelSetup — loaded once, before first callPer-call prompt wraps
Input Guard (when LLM-based)small fast classifier or content-safety model (e.g. Llama Prompt Guard, NVIDIA NeMo content safety)role ("you screen user inputs for injection, jailbreak, prohibited content"), the categorical policy, output contract (PASS / REJECT / SANITISE + reason code)the raw user input
Response Guard (when LLM-based)small fast classifier (e.g. Llama Guard for content; Llama Prompt Guard for injection)role ("you screen tool responses for injection payloads, schema violations, sensitive content"), output contract (PASS / REJECT / SANITISE + extracted-payload field)the tool name + the response
Output Guard (when LLM-based)small fast classifier; for highest-stakes outputs, an LLM-as-judge call (V15)role ("you check final outputs for PII, prohibited claims, harmful content"), the categorical policy, output contract (RELEASE / REDACT / BLOCK + redaction map)the user query + the proposed final response
Agentthe system's main generalistthe standard agent setup (S3, S5, S6, domain context) — V5 is orthogonal to thisthe per-turn context

Specialist-model note. Two specialists are routinely required and should be named as build dependencies, not assumed. (1) A content-safety classifier — Llama Guard 3, NVIDIA NeMo's content-safety models, or Azure Prompt Shields' hosted detector — fine-tuned for the safety-category taxonomy and substantially better at this job than a general-purpose model. (2) A prompt-injection classifier — Llama Prompt Guard 2 or equivalent — trained on injection corpora. Rule-based guards (PII regexes, JSON-schema validators, parameter allow-lists) need no specialist and should be used wherever the violation has a deterministic signature. The pattern's quality is capped by the weakest of the three — the rules, the classifiers, and the LLM-as-judge calls — so the build cost is real and should be planned.

Open-Source Implementations

  • NVIDIA NeMo Guardrailsgithub.com/NVIDIA-NeMo/Guardrails — programmable guardrails toolkit with explicit input, dialog, output, retrieval, and execution rails; Colang DSL for policy; integrations with content-safety and topic-safety models. The closest mainstream framework to the full four-points structure.
  • Guardrails AIgithub.com/guardrails-ai/guardrails — Python framework with first-class Guard objects, an Input/Output guard split, and a Guardrails Hub of pre-built validators. The .rail policy file is a useable Policy Registry artifact.
  • Meta Purple Llama / Llama Guardgithub.com/meta-llama/PurpleLlama — content-safety classifier models (Llama Guard 3 in 1B / 8B / 11B-vision) and prompt-injection classifiers (Llama Prompt Guard 2). The canonical open-weight specialist models for the Input / Response / Output guards.
  • Microsoft Prompt Shields — Azure AI Content Safety — learn.microsoft.com/.../jailbreak-detection — hosted detector for User Prompt attacks and Document attacks; not open-source but production-grade and widely deployed. Spotlighting (2025) adds indirect-injection detection in untrusted documents.
  • OpenAI Agents SDK guardrailsopenai.github.io/openai-agents-python/guardrails — input guardrails, output guardrails, and tool guardrails (pre-call and post-call) as first-class SDK constructs. The framework's API directly encodes the four-points model.

Known Uses

  • OpenAI's hosted Agents platform — Agents SDK ships with input, tool, and output guardrails as first-class objects; the production default for agents built on the platform.
  • Microsoft Copilot family — Azure AI Content Safety with Prompt Shields enforced across the consumer and enterprise Copilots; document attacks treated as a distinct surface from user prompts.
  • Enterprise RAG and customer-support deployments — NeMo Guardrails widely deployed for input/output content-safety and topic-restriction policies; the public case studies trend toward financial and healthcare verticals.
  • AWS Bedrock Guardrails — managed input/output filtering with topic, content, contextual-grounding, and PII filters; one of the standard production deployment paths.
  • Guardrails AI Hub validators in production Python stacks for PII, secrets, profanity, factuality-against-source, and SQL-injection checks.
  • Pairs with S5 Constraint Framing — S5 is the in-prompt prohibition (model self-restraint); V5 is the external enforcement. Always pair; never rely on S5 alone for catastrophic violations.
  • Pairs with V6 Prompt Injection Shield — V6 specialises the input and response guards against injection-specific threats; V5 is the broader structural pattern V6 plugs into.
  • Pairs with V7 AgentSpec / Declarative Governance — V7 provides the Policy Registry; V5's guards enforce what V7 declares.
  • Pairs with V14 Trajectory Logging — every guard decision and its evidence must be logged for audit and tuning.
  • Pairs with V1 Human-in-the-Loop — the escalation target when a guard's verdict is uncertain (fail-closed routes to V1).
  • Composes with V4 Dual LLM — V4 is architectural separation of privileged and quarantined models; V5 is the boundary enforcement around either. Together they handle the V3 Lethal Trifecta cases.
  • Composes with V15 LLM-as-Judge — an LLM-as-judge can be invoked as the Output Guard (or Response Guard) on highest-stakes content; V15 is one implementation of a V5 guard, not a substitute for V5.
  • Composes with V8 Tool Sandboxing — V5's Pre-Call Guard validates policy; V8 isolates the tool's actual execution. Both are required for the V3 Lethal Trifecta cases.
  • Composes with V9 Bounded Execution — the tool-call loop V5 sits inside must be bounded, or a stuck agent will hammer guards without making progress.
  • Distinct from S5 Constraint Framing — see Motivation; this is the load-bearing distinction.
  • Distinct from V6 Prompt Injection Shield — V6 is the injection-specific defense family; V5 is the broader four-point structure V6 plugs into. Treating them as the same pattern collapses the structure.
  • Resolves anti-pattern A5 Output-Only Guardrails — V5 is the named alternative to A5. Any system where the only safety layer is an output filter is in A5.

Sources

  • OWASP — "OWASP Top 10 for Large Language Model Applications" (LLM01 Prompt Injection; LLM02 Insecure Output Handling), 2024–2025 revisions.
  • NIST — "AI Risk Management Framework" (AI RMF 1.0) and the Generative AI Profile.
  • Willison, S. — "The lethal trifecta for AI agents: private data, untrusted content, and external communication" (simonwillison.net, 2025).
  • Anthropic — "Building Effective Agents" (multi-point validation guidance).
  • NVIDIA — NeMo Guardrails documentation; input, output, dialog, retrieval, and execution rails.
  • Guardrails AI — documentation for the Input/Output Guard model and .rail specification.
  • Microsoft — "Prompt Shields in Azure AI Content Safety" (Microsoft Learn); Spotlighting announcement, Microsoft Build 2025.
  • OpenAI — Agents SDK documentation, "Guardrails" section (input, output, and tool guardrails).
  • Meta — Purple Llama project; Llama Guard 3 and Llama Prompt Guard 2 model cards.
  • 12-Factor Agents — Factor 11 ("Trigger from Anywhere, Trust Nobody").