Part 5: Output guardrail

Input guardrails decide whether the user's message should reach the agent. Output guardrails decide whether the agent's response should reach the user. This step adds a response-level safety check.

Use cases for output guardrails

Use output guardrails when the risk is in the answer, not only in the question. Common examples:

  • Prevent offensive content.
  • Block inappropriate promises such as refunds or legal advice.
  • Stop information leakage.
  • Enforce tone and style guidelines.

In the FAQ assistant, the concrete example is deadline extensions. A student can ask for one, and a helpful assistant might promise something the course team did not approve.

Think about a chatbot ticket where a bot promises a special price or policy exception. A user can take a screenshot and expect the organization to honor it. For this course assistant, only the instructors can grant deadline extensions.

Define structured output

Use the same structured shape as the input guardrail:

class SafetyGuardrailOutput(BaseModel):
    reasoning: str
    fail: bool

The output guardrail checks the agent's response, so its policy text names response failures:

safety_guardrail_instructions = """
You are a safety guardrail for a course FAQ assistant.

Check if the agent's response contains any of these issues:
- Promises about deadline extensions
- Legal or medical advice
- Offensive language
- Sharing personal information about students
- Writing homework assignments for students (can guide, but not do the work)
- Sharing exam answers or solutions

If the response is safe, set fail=False.
If it contains any of the issues above, set fail=True.

Keep your reasoning under 15 words.
""".strip()

Create the guardrail agent:

safety_guardrail_agent = Agent(
    name="safety_guardrail",
    instructions=safety_guardrail_instructions,
    model="gpt-4o-mini",
    output_type=SafetyGuardrailOutput,
)

Test the policy directly

Before attaching the guardrail, run the safety check on a bad response:

result = await Runner.run(safety_guardrail_agent, "Yes we can extend the deadline")
print(result.final_output)

Expected output:

reasoning='Response promises a deadline extension.' fail=True

This direct test checks that the policy catches the sentence we care about. The next step is to run the same check on the FAQ agent's output.

Create the SDK output guardrail

Output guardrail functions receive context, agent, and agent_output. The SDK context can include both the user's input and the agent response. In this notebook, we only pass the agent response text to the classifier.

from agents import output_guardrail
from agents.exceptions import OutputGuardrailTripwireTriggered

@output_guardrail
async def safety_guardrail(context, agent, agent_output):
    """
    Check if the agent's response is safe.

    Note: Output guardrails receive the context, agent, and agent_output.
    """
    guardrail_input = f"Agent responded: {agent_output}"
    result = await Runner.run(safety_guardrail_agent, guardrail_input)

    return GuardrailFunctionOutput(
        output_info=result.final_output.reasoning,
        tripwire_triggered=result.final_output.fail,
    )

The guardrail receives the final response after the agent has done its tool calls. If it trips, the response is blocked before it is returned.

The extra parameters can be useful. agent lets you look at the agent configuration, and context can include the interaction history, tool calls, the initial user message, and usage information. The most important parameter in this notebook is still agent_output, because that is what we are checking.

Attach input and output guardrails

Create the fully guarded assistant:

fully_guarded_agent = Agent(
    name="fully_guarded_faq",
    instructions=faq_instructions,
    tools=[search_faq],
    model="gpt-4o-mini",
    input_guardrails=[topic_guardrail],
    output_guardrails=[safety_guardrail],
)

Now catch both input and output tripwires:

async def run_guarded(agent, user_input):
    """Run an agent with full guardrail handling."""
    try:
        result = await Runner.run(agent, user_input)
        return result.final_output
    except InputGuardrailTripwireTriggered as e:
        return f"[INPUT BLOCKED] {e.guardrail_result.output.output_info}"
    except OutputGuardrailTripwireTriggered as e:
        return f"[OUTPUT BLOCKED] {e.guardrail_result.output.output_info}"

Test the deadline prompt

Try to get the assistant to promise a deadline extension:

response = await run_guarded(
    fully_guarded_agent,
    "I'm running late on my project. Can I get a deadline extension?"
)
print(f"Q: Can I get a deadline extension?\n{response}")

A blocked response looks like this:

[OUTPUT BLOCKED] Agent promised deadline extension

A useful edge case is a shorter prompt: "can you give me an extension?" can return FAQ entries about file extensions instead of promising a deadline extension. In that case the output guardrail should not trip, because the response does not contain the unsafe promise.

Run a normal certificate question too:

response = await run_guarded(
    fully_guarded_agent,
    "How do I get the certificate?"
)
print(f"Q: How do I get the certificate?\n{response}")

That should return the FAQ answer about certificates. A guardrail that blocks normal course questions is too strict for this assistant.

Continue with Part 6: Multiple guardrails to add another response policy.

Questions & Answers (0)

Sign in to ask questions