Skip to main content
This example demonstrates how to use Agent as Judge evaluation to assess the main agent’s output as a background task. Unlike blocking validation, background evaluation:
  • Does NOT block the response to the user
  • Logs evaluation results for monitoring and analytics
  • Can trigger alerts or store metrics without affecting latency
Use cases:
  • Quality monitoring in production
  • Compliance auditing
  • Validating hallucinations or other inappropriate content
1

Create a Python file

background_output_evaluation.py
from agno.agent import Agent
from agno.db.sqlite import AsyncSqliteDb
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIResponses
from agno.os import AgentOS

# Setup database for agent and evaluation storage
db = AsyncSqliteDb(db_file="tmp/evaluation.db")

# Create the evaluator using Agent as Judge
evaluator = AgentAsJudgeEval(
    db=db,
    name="Response Quality Check",
    model=OpenAIResponses(id="gpt-5.2"),
    criteria="Response should be helpful, accurate, and well-structured",
    additional_guidelines=[
        "Evaluate if the response addresses the user's question directly",
        "Check if the information provided is correct and reliable",
        "Assess if the response is well-organized and easy to understand",
    ],
    threshold=7,
    run_in_background=True,  # Runs evaluation without blocking the response
)

# Create the main agent with Agent as Judge evaluation
main_agent = Agent(
    id="support-agent",
    name="CustomerSupportAgent",
    model=OpenAIResponses(id="gpt-5.2"),
    instructions=[
        "You are a helpful customer support agent.",
        "Provide clear, accurate, and friendly responses.",
        "If you don't know something, say so honestly.",
    ],
    db=db,
    post_hooks=[evaluator],  # Automatically evaluates each response
    markdown=True,
)

# Create AgentOS
agent_os = AgentOS(agents=[main_agent])
app = agent_os.get_app()


if __name__ == "__main__":
    agent_os.serve(app="background_output_evaluation:app", port=7777, reload=True)
2

Set up your virtual environment

uv venv --python 3.12
source .venv/bin/activate
3

Install dependencies

uv pip install -U agno openai uvicorn
4

Export your OpenAI API key

export OPENAI_API_KEY="your_openai_api_key_here"
5

Run the server

python background_output_evaluation.py
6

Test the endpoint

curl -X POST http://localhost:7777/agents/support-agent/runs \
  -F "message=How do I reset my password?" \
  -F "stream=false"
The response will be returned immediately. The evaluation runs in the background and results are stored in the database.

What Happens

  1. User sends a request to the agent
  2. The agent processes and generates a response
  3. The response is sent to the user immediately
  4. Background evaluation runs:
    • AgentAsJudgeEval automatically evaluates the response against the criteria
    • Scores the response on a scale of 1-10
    • Stores results in the database

Production Extensions

In production, you could extend this pattern to:
ExtensionDescription
Database StorageStore evaluations for analytics dashboards
AlertingUse on_fail callback to send alerts when evaluations fail
ObservabilityLog to platforms like Datadog or OpenTelemetry
A/B TestingCompare response quality across model versions
Training DataBuild datasets for fine-tuning
Background evaluation is ideal for quality monitoring without impacting user experience. For scenarios where you need to block bad responses, use synchronous hooks instead.