Evaluation Prompts: Why They Matter

Evaluation prompts are Agent Status's secret weapon for detecting when an agent is 'up but broken.'

The Problem

Imagine your agent is online, returning HTTP 200, but:

It's returning error messages instead of answers
The model behind it changed and now gives wrong answers
It's rate-limited and returning "try again later"
It's returning cached/stale responses

A simple health check would say "UP" — but your users are having a terrible experience.

The Solution: Evaluation Prompts

Evaluation prompts are deterministic questions with known-correct answers. Agent Status sends these alongside your configured prompts and verifies the responses.

Examples:

Evaluation Prompt	Expected Answer
"What is 2+2?"	Must contain "4"
"Return JSON: {\"x\":1}"	Must be valid JSON containing x:1
"Echo exactly: hello"	Must contain "hello"

If your agent can't answer "What is 2+2?" correctly, it's not working — no matter what the HTTP status says.

Two Tiers of Evaluation Prompts

Health Tier (Loose Matching)

For prompts where the answer can appear anywhere in the response.

Example:

Prompt: "What is 2+2?"
Valid responses:

- "4"

- "The answer is 4."

- "2+2 equals 4, which is a basic arithmetic operation."

Why: LLMs are verbose. We don't penalize explanation.

Contract Tier (Strict Matching)

For prompts requiring exact format compliance.

Example:

Prompt: "Return JSON: {\"status\":\"ok\"}"
Valid responses:

- {"status":"ok"}

- {"status": "ok"} (whitespace ok)

Invalid responses:

- "Here's the JSON: {\"status\":\"ok\"}" (extra text)

- "status: ok" (not JSON)

Why: API contracts are exact. If your agent claims to return JSON, it must return JSON.

Metrics

Agent Status reports two eval pass rates:

Metric	Description
`health_eval_pass_rate`	% of health-tier prompts passed
`contract_eval_pass_rate`	% of contract-tier prompts passed
`eval_pass_rate`	Overall evaluation prompt pass rate

A high health rate but low contract rate means: "The model works, but it's not following the expected format."

Impact on Verdict

Gold prompt results affect your verdict:

≥80% eval pass rate — No penalty, can be UP
<80% eval pass rate — DEGRADED (even if reachable)
0% eval pass rate — Usually DOWN

This catches the "up but broken" failure mode.

Can I Customize Evaluation Prompts?

Currently, Agent Status uses a standard set of evaluation prompts. Custom evaluation prompts are planned for Phase 4.

The current prompts are:

Basic math (health tier)

JSON echo (contract tier)

Simple echo (health tier)

These work for any general-purpose agent.

Why "Gold"?

In machine learning, "gold data" or "gold standard" refers to known-correct labels used as ground truth. Evaluation prompts are our ground truth for agent behavior.

FAQ

Q: Will evaluation prompts affect my agent's behavior?

A: No. Evaluation prompts are just queries. Your agent responds and we validate — nothing is stored or affects your users.

Q: What if my agent can't do math?

A: Evaluation prompts are chosen to work with any general LLM-based agent. If your agent is specialized (e.g., a code-only assistant), contact us for guidance.

Q: How many evaluation prompts per check?

A: Currently 2-3 per run, adding minimal overhead.

Evaluation Prompts: Why They Matter

The Problem

The Solution: Evaluation Prompts

Two Tiers of Evaluation Prompts

Health Tier (Loose Matching)

Contract Tier (Strict Matching)

Metrics

Impact on Verdict

Can I Customize Evaluation Prompts?

Why "Gold"?

FAQ

Related Articles

Agent vs Validator Errors

Geographic Validation

Latency & TTFB SLA

Need more help?