Support CenterConceptsEvaluation Prompts: Why They Matter
Back to Concepts

Evaluation Prompts: Why They Matter

Evaluation prompts are Agent Status's secret weapon for detecting when an agent is 'up but broken.'

The Problem

Imagine your agent is online, returning HTTP 200, but:

  • It's returning error messages instead of answers
  • The model behind it changed and now gives wrong answers
  • It's rate-limited and returning "try again later"
  • It's returning cached/stale responses

A simple health check would say "UP" — but your users are having a terrible experience.

The Solution: Evaluation Prompts

Evaluation prompts are deterministic questions with known-correct answers. Agent Status sends these alongside your configured prompts and verifies the responses.

Examples:

Evaluation PromptExpected Answer
"What is 2+2?"Must contain "4"
"Return JSON: {\"x\":1}"Must be valid JSON containing x:1
"Echo exactly: hello"Must contain "hello"

If your agent can't answer "What is 2+2?" correctly, it's not working — no matter what the HTTP status says.

Two Tiers of Evaluation Prompts

Health Tier (Loose Matching)

For prompts where the answer can appear anywhere in the response.

Example:

  • Prompt: "What is 2+2?"
  • Valid responses:
- "4"

- "The answer is 4."

- "2+2 equals 4, which is a basic arithmetic operation."

Why: LLMs are verbose. We don't penalize explanation.

Contract Tier (Strict Matching)

For prompts requiring exact format compliance.

Example:

  • Prompt: "Return JSON: {\"status\":\"ok\"}"
  • Valid responses:
- {"status":"ok"}

- {"status": "ok"} (whitespace ok)

  • Invalid responses:
- "Here's the JSON: {\"status\":\"ok\"}" (extra text)

- "status: ok" (not JSON)

Why: API contracts are exact. If your agent claims to return JSON, it must return JSON.

Metrics

Agent Status reports two eval pass rates:

MetricDescription
health_eval_pass_rate% of health-tier prompts passed
contract_eval_pass_rate% of contract-tier prompts passed
eval_pass_rateOverall evaluation prompt pass rate

A high health rate but low contract rate means: "The model works, but it's not following the expected format."

Impact on Verdict

Gold prompt results affect your verdict:

  • ≥80% eval pass rate — No penalty, can be UP
  • <80% eval pass rate — DEGRADED (even if reachable)
  • 0% eval pass rate — Usually DOWN

This catches the "up but broken" failure mode.

Can I Customize Evaluation Prompts?

Currently, Agent Status uses a standard set of evaluation prompts. Custom evaluation prompts are planned for Phase 4.

The current prompts are:

  • Basic math (health tier)
  • JSON echo (contract tier)
  • Simple echo (health tier)
  • These work for any general-purpose agent.

    Why "Gold"?

    In machine learning, "gold data" or "gold standard" refers to known-correct labels used as ground truth. Evaluation prompts are our ground truth for agent behavior.


    FAQ

    Q: Will evaluation prompts affect my agent's behavior?

    A: No. Evaluation prompts are just queries. Your agent responds and we validate — nothing is stored or affects your users.

    Q: What if my agent can't do math?

    A: Evaluation prompts are chosen to work with any general LLM-based agent. If your agent is specialized (e.g., a code-only assistant), contact us for guidance.

    Q: How many evaluation prompts per check?

    A: Currently 2-3 per run, adding minimal overhead.

    Need more help?

    Our support team is available to assist you

    Contact Support