Evaluation Prompts: Why They Matter
Evaluation prompts are Agent Status's secret weapon for detecting when an agent is 'up but broken.'
The Problem
Imagine your agent is online, returning HTTP 200, but:
- It's returning error messages instead of answers
- The model behind it changed and now gives wrong answers
- It's rate-limited and returning "try again later"
- It's returning cached/stale responses
A simple health check would say "UP" — but your users are having a terrible experience.
The Solution: Evaluation Prompts
Evaluation prompts are deterministic questions with known-correct answers. Agent Status sends these alongside your configured prompts and verifies the responses.
Examples:
| Evaluation Prompt | Expected Answer |
|---|---|
| "What is 2+2?" | Must contain "4" |
| "Return JSON: {\"x\":1}" | Must be valid JSON containing x:1 |
| "Echo exactly: hello" | Must contain "hello" |
If your agent can't answer "What is 2+2?" correctly, it's not working — no matter what the HTTP status says.
Two Tiers of Evaluation Prompts
Health Tier (Loose Matching)
For prompts where the answer can appear anywhere in the response.
Example:
- Prompt: "What is 2+2?"
- Valid responses:
- "The answer is 4."
- "2+2 equals 4, which is a basic arithmetic operation."
Why: LLMs are verbose. We don't penalize explanation.
Contract Tier (Strict Matching)
For prompts requiring exact format compliance.
Example:
- Prompt: "Return JSON: {\"status\":\"ok\"}"
- Valid responses:
{"status":"ok"}
- {"status": "ok"} (whitespace ok)
- Invalid responses:
- "status: ok" (not JSON)
Why: API contracts are exact. If your agent claims to return JSON, it must return JSON.
Metrics
Agent Status reports two eval pass rates:
| Metric | Description |
|---|---|
health_eval_pass_rate | % of health-tier prompts passed |
contract_eval_pass_rate | % of contract-tier prompts passed |
eval_pass_rate | Overall evaluation prompt pass rate |
A high health rate but low contract rate means: "The model works, but it's not following the expected format."
Impact on Verdict
Gold prompt results affect your verdict:
- ≥80% eval pass rate — No penalty, can be UP
- <80% eval pass rate — DEGRADED (even if reachable)
- 0% eval pass rate — Usually DOWN
This catches the "up but broken" failure mode.
Can I Customize Evaluation Prompts?
Currently, Agent Status uses a standard set of evaluation prompts. Custom evaluation prompts are planned for Phase 4.
The current prompts are:
These work for any general-purpose agent.
Why "Gold"?
In machine learning, "gold data" or "gold standard" refers to known-correct labels used as ground truth. Evaluation prompts are our ground truth for agent behavior.
FAQ
Q: Will evaluation prompts affect my agent's behavior?
A: No. Evaluation prompts are just queries. Your agent responds and we validate — nothing is stored or affects your users.
Q: What if my agent can't do math?
A: Evaluation prompts are chosen to work with any general LLM-based agent. If your agent is specialized (e.g., a code-only assistant), contact us for guidance.
Q: How many evaluation prompts per check?
A: Currently 2-3 per run, adding minimal overhead.
Related Articles
Agent vs Validator Errors
Not all failures are your fault. Agent Status distinguishes between problems with your agent and problems with our infrastructure.
Geographic Validation
Agent Status tests your agent from real devices around the world. Here's why that matters.
Latency & TTFB SLA
Fast responses matter. Agent Status tracks latency and enforces SLA thresholds to ensure your agent provides a good user experience.