AgentStatus by Carmel Labs · June 2026

The Two Failures Hiding in
LLM-as-a-Judge

Calibration problems shrink with better technique. Competence problems do not. Why the dominant evaluation paradigm has a structural ceiling, and the two methods older than language models that get past it.

51%

Production validations returned degraded

2

Distinct failures inside the paradigm

3

Tiers of metamorphic relation

0

Gold labels required

agentstatus.dev | June 2026

1. The Wall

51% of validations came back degraded.

Earlier this month we ran an outside-in validation sweep against three production customer-service agents hosted on a major communications platform. The agents had been live for months. They were monitored by their operator's internal stack. They were green on the platform's own dashboard.

Fifty-one percent of validations came back degraded.

Not down. Not unreachable. Degraded. They answered the phone. They responded to questions. They produced output that platform-level monitoring counted as a success. And on more than half of the runs, the output failed.

This is not unusual. We see versions of this pattern in nearly every voice and chat agent we validate from outside the operator's infrastructure. An agent can be one hundred percent up and zero percent useful at the same time. The reason this is possible is not that the operators are incompetent or their tools are broken. The reason is that the dominant paradigm for evaluating AI agents in production has a structural ceiling, and the entire industry is currently competing on the wrong side of that ceiling.

The ceiling has a name. It is LLM-as-a-judge.

100%

Uptime on the operator's dashboard

51%

Validations degraded from outside-in

3

Production agents in the sweep

If you have not heard the term, here is the short version. To evaluate an AI agent at scale, you have to grade its outputs. Hiring humans to read and score every output is too slow and too expensive once you are running thousands of tests a day. So the industry settled, around 2023, on a different approach: use another language model to do the grading. You give the judge model the user's question, the agent's answer, and a rubric describing what a good answer looks like. The judge returns a verdict. PASS, FAIL, or some score in between.

This solves a real problem. A team running thousands of test prompts per day cannot have a human read every response, but a judge model can process them in seconds at near-zero cost. The approach scaled beautifully. By 2024 it was the dominant pattern in agent evaluation, and by 2025, almost every serious evaluation product on the market, and the internal evaluation infrastructure at almost every serious AI company, rested on some version of it.

The variations are real. The paradigm is the same: write a set of test prompts, capture the agent's outputs, ask a language model to score whether each output was acceptable. The scoring model is often more capable than the agent being scored. Sometimes two scorers vote. Sometimes the prompts are rotated, the scorers debiased, the rubrics tightened.

Most of the public conversation about evaluation has settled into refining this paradigm. Better judges. More representative test sets. Calibration techniques. Position-bias mitigations. The papers improve. The judges get sharper.

And the underlying problem does not move.

This essay argues that the reason it does not move is that LLM-as-a-judge has two distinct failures inside it, and the industry has been working on one of them while ignoring the other. The one being worked on is real but mostly tractable. The one being ignored is structural and cannot be tuned away. Until that distinction is made clear, the paradigm cannot improve past its current ceiling, and the agents we validate from outside will keep failing in ways their operators never see.

The good news is that the failure ignored by the industry is also the failure that opens the cleanest path out of the paradigm entirely. The path requires giving up on a comforting assumption (that we will eventually have a judge smart enough to evaluate any agent in any domain) and replacing it with two ideas older than language models and much harder to defeat.

We will get to those ideas. First we have to be precise about the wall.

2. Two Failures

Not one operation. Two, stitched together.

When practitioners talk about LLM-as-a-judge, they usually have in mind a single workflow: input goes to the agent, the agent produces output, the judge scores it. They reason about the workflow as if it were one operation with one failure mode. It isn't. It is two operations stitched together, and each has its own distinct failure surface. Conflating them is the source of most muddy thinking about evaluation.

The first failure is in the input.

A test set is a finite collection of inputs. To say "the agent passed our evaluation" is to say "the agent passed on the inputs we chose to test." The implicit claim is that the inputs are representative, meaning performance on the test set predicts performance on real production traffic.

For most agent domains, this claim is false. Production input distributions for open-world agents are not stable. They drift as users discover new use cases, as the world changes, as adversaries probe. A test set written by the operator's team in March is sampled from the operator's team's imagination. Real users in October are not sampled from that distribution.

This is the enumeration failure. It is not a question of how many test cases you have. The 4,000th test case covers its own point and a small neighborhood around it. The uncovered region of the input space is not a fixed territory you are slowly filling in. It is an open frontier that grows as you explore it, because every new agent capability and every new user segment opens new regions to probe. You are not painting a wall. You are painting a wall that extends itself wherever you are not looking.

The second failure is in the output.

Suppose the input problem were solved. Suppose, by some unrealistic stroke of luck, the test set were perfectly representative. You still have to decide whether each output the agent produced was correct, and for any output beyond exact-match (and open-ended agent output is always beyond exact-match) something has to make that decision. The dominant answer is a judge model.

The judge model's failures come in two layers, and the distinction between these layers is the most important idea in this essay.

The shallow layer is calibration. Judges have biases. They prefer outputs from their own model family (self-preference). They reward confident fluent prose over correct content (sycophancy). They favor the first option presented in a pairwise comparison (position bias). They give different scores on reruns of the same input (judge non-determinism).

These are real problems. They are also, in principle, correctable. Swap judges. Use consensus. Randomize position. Debias systematically. Most of the academic and industry work on LLM-as-a-judge is, at heart, work on calibration. Dual-judge designs, multi-model panels, debiased rubrics: all calibration techniques. The field has made meaningful progress on this layer, and it will keep making progress.

The deep layer is competence.

Competence asks a different question. Not "is the judge biased toward this output or that one," but "does the judge know what correct looks like in this domain at all." A judge can be perfectly unbiased and still be incompetent. Fluent, even-handed, confident, and wrong, because it has no reliable internal model of what a correct answer in this domain even is.

For most general-purpose tasks, modern frontier models have reasonable internal models of correctness. They can judge a Python function. They can judge a paragraph of English prose. They can judge a customer-service reply in a familiar register. The competence ceiling does not bite in the most-tested domains, which is exactly why most published evaluation research does not encounter it.

The ceiling bites hard, and structurally, in any specialized or safety-critical domain. Trades. Healthcare. Finance. Industrial operations. Insurance claims. Legal review. Domains where training data is sparse, where there is no canonical authoritative corpus, where the failure cost is physical or regulated.

In those domains, a judge is being asked to evaluate correctness on terrain it has never reliably mapped. Worse, because the judge is fluent and confident in its output, the failures look like signal. Two judges that both lack HVAC competence and agree that the agent gave good HVAC advice do not, between them, somehow produce HVAC competence. They produce confidently expressed shared ignorance, which is strictly worse than one judge expressing uncertainty, because agreement reads as a signal of correctness when it is actually a signal of nothing.

Memorize this if nothing else from this essay survives: Calibration problems shrink with better technique. Competence problems do not.

A better dual-judge design will give you more accurate scores in domains where the judges know what correct looks like. It will give you more confidently expressed nonsense in domains where they don't. No amount of cross-judge consensus repairs a judge that doesn't have the underlying knowledge.

The eval industry, as far as I can tell, is mostly working on calibration. The competence ceiling is the real wall, and it is the one nobody can technique past, because it is not a flaw in the judge. It is a flaw in using judgment at all for a domain the judge does not command.

3. Why It Compounds

The failures are multiplicative, not additive.

The enumeration failure and the judgment failure interact. They are not additive. They are multiplicative, and the multiplication is worst exactly where it matters most.

For an open-world high-stakes agent (and almost every agent worth validating in production is open-world and high-stakes by the time it reaches production) you face an input space you cannot enumerate and an output space you cannot reliably judge. The enumeration failure means you test a vanishing fraction of the behavior space. The judgment failure means that even on the fraction you do test, your verdict is unreliable wherever the domain exceeds the judge's competence.

The fraction of agent behavior you can both reach and reliably score is the intersection of "inputs your team thought of" and "outputs the judge can correctly grade." This intersection shrinks precisely as stakes and specialization rise. Smallest exactly where it matters most.

In one sentence: gold-prompts-plus-judge (the industry shorthand for the paradigm above, where a curated test set is scored by a judge model) tests the cases you imagined, scored by a model that may not know the domain, so it is weakest exactly where the domain is hardest and the cost of being wrong is highest.

That is the wall. The next two sections are about getting around it.

4. What Survives

Untethered from inputs. Untethered from correct answers.

The escape, if there is one, must be untethered from specific inputs (to beat enumeration) and untethered from knowing the correct output (to beat competence). At first glance these constraints sound impossible to satisfy together. You cannot grade a system without inputs to test it on or some notion of what a correct answer looks like.

But the constraints can be satisfied. Two distinct families of method qualify, and they escape by different mechanisms.

The first family is invariants: properties true of every output regardless of input. Statements that hold across the entire input space at once.

Consider the difference between two ways of evaluating a bank support agent.

The example-based way: "When asked about a fee waiver, the agent should respond that fees can be waived for accounts under certain conditions." This is bound to a specific input. It tests one case. It will not catch the agent promising a waiver on a different input you didn't think to write.

The property-based way: "The agent must never promise a fee waiver outside written policy." This is bound to no specific input. It is a rule that holds across every possible user query, including queries nobody on your team has ever imagined.

The second framing escapes the enumeration failure because it quantifies over the entire input space. A new unimagined input is still subject to the law. The frontier that defeated example-based testing does not defeat an invariant, because the invariant already covers the frontier by construction.

It also escapes the competence ceiling, but only when authored correctly. The rule "never promise a fee waiver outside written policy" was written by the bank's compliance team, who command the domain. The checker only has to detect whether the rule was violated, which is a far narrower task than judging overall correctness. The domain competence lives in the human-authored rule. The system performs detection, not judgment.

This is the central move. Most invariants can be checked deterministically: regex, structured parse, exact match on a structured decision field. No LLM is involved in the verdict at all. For those that require interpreting meaning, the LLM is asked a narrow binary question ("did this output recommend skipping a required step, yes or no") rather than an open-ended one ("is this good advice"). A competence-limited LLM can reliably answer a narrow yes/no even in a domain it does not master, because narrow comparison is a linguistic task while open-ended judgment is a domain task.

Invariants are necessary but not sufficient. They handle the failures that breach known rules. They do not handle failures of consistency, robustness, or coherence under variation: failures where no single output violates a rule, but the agent's pattern of outputs is incoherent in a way that matters.

For those, we need the second family.

5. Metamorphic Relations

Testing without ground truth.

The second family is metamorphic testing: statements about how two outputs relate, regardless of what the correct output is. You compare an output to another output, never to a correct answer. Correctness drops out of the equation.

This is the family the agent eval community has barely touched, and it is by some distance the most powerful thing in this essay.

Here is the simplest version. Send the agent a question. Send the same agent the same question, phrased differently. The two outputs should be the same. If they are not, the agent is unstable, and you have caught the instability without ever needing to know which of the two answers was correct.

That is metamorphic testing in its bare form. It generalizes spectacularly, and the generalizations split into three tiers, each measuring something different.

Tier	What changes	What's measured	Example
1. Invariance	Input transformed cosmetically	Stability	Rephrase the question — answer should be identical
2. Directional	Input transformed meaningfully	Correctness	Increase fraud amount — risk score must not decrease
3. Compositional	Multiple transforms composed	Accuracy	If A>B and B>C, then A>C — provable incoherence

Tier 1: invariance relations

You transform the input in a way that should not matter. The output should not change.

Rephrase the question in different words: the answer should not change. Translate the question to Spanish: the answer should match the English version. Add an irrelevant detail ("by the way, my dog was barking the whole time I was on hold"): the answer should not change. Reorder independent symptoms in a medical complaint: the conclusion should be stable. Change the user's name from John to Jamal: the recommendation should be identical.

This is the tier most people think of when they hear "metamorphic." It measures stability. If a cosmetic transformation changes the output, the agent is unstable, and instability of this kind is invisible to any single-output evaluation method. A judge looking at either answer in isolation says "looks fine." Only the relationship between them reveals the failure.

The killer instance of Tier 1 is a transformation that has no text analog: caller identity. Run the same scenario with a US-accent caller and a Lagos-accent caller. Same words, same scenario, different voice. The agent's answer should not depend on caller accent. When it does (and in voice agents, it frequently does) you have just produced a finding that no inside-out tracing tool can catch. By "inside-out" I mean the family of observability tools that watch what is happening inside the agent: which model was called, which tool was used, what the intermediate steps were. Useful tools, but blind to this kind of failure, because the agent's internal trace looks identical in both cases. The failure lives in the relationship between two production runs, not in either run alone.

Tier 2: directional relations

You transform the input in a way that should change the output, and you know which direction the change should go.

Increase the transaction amount on a fraud-risk query. The agent's risk assessment must not decrease. Add a more severe symptom to a medical complaint. The urgency assessment must go up, not down. Remove identity verification from a banking request. The agent's willingness to proceed must decrease, not increase. Add explicit signs of distress to a customer-support complaint. The agent's escalation likelihood must rise.

In every Tier 2 case, you do not know the correct output. You know which direction the correct output must move. An agent that moves it the wrong way is provably wrong without a gold label, because monotonicity is a property of the correct answer. You are testing whether the agent respects the logical structure that any correct answer must respect.

This is the tier that catches correctness errors, not just stability errors. An agent can be perfectly consistent (Tier 1 passes) and still violate monotonicity (Tier 2 fails). That violation is a genuine correctness bug, and you caught it without any ground truth at all.

Tier 3: compositional relations

You apply transformations whose combined effect is known, and check whether the agent's outputs compose correctly.

If the agent says case A is riskier than case B, and case B is riskier than case C, then the agent must say case A is riskier than case C. You never need the true risk of any of them. You are checking transitivity, a property the correct answer must have. An agent that ranks A above B, B above C, but C above A is provably incoherent, and the incoherence is a measurable accuracy failure.

You can build whole webs of these. Transitivity. Symmetry (swapping two parties should swap the answer). Additivity (combining two cases should aggregate predictably). Idempotence (applying a harmless operation twice should change nothing). Each is a mathematical property the correct answer must obey. Each catches a different class of error without a single gold label.

The reframe

Three tiers, three properties measured: consistency, correctness, accuracy. None of them require knowing the right answer. All of them require only knowing the structural properties the right answer must have. Those properties are domain-authored, like invariants, but they are far easier to elicit than gold answers and far more durable, and once authored they generate an unlimited number of test cases.

This is the reframe the evaluation industry has not yet absorbed. The industry has spent the last several years asking "how do we get a more accurate judge." The answer this essay proposes is: don't try. Use the judge for what it is good at (drafting candidates, generating transformations, doing narrow binary comparisons) and locate the actual correctness signal somewhere that doesn't depend on judgment at all.

One small but important point. Metamorphic testing only escapes ground truth if the baseline is the agent compared to itself. Before testing any transformation, you send the exact same input many times and measure how much the output varies with zero change. That establishes the agent's intrinsic noise floor. A transformation only constitutes a violation if the transformed-input variance exceeds the same-input variance by margin. The agent is its own control group.

6. The Honest Accounting

Every clean argument has a residue.

Metamorphic testing measures stability. Invariants measure rule-compliance. Neither directly measures correctness in the sense of "did the agent give the answer that a domain expert would have given." There is a class of failure that escapes both methods: the agent is consistently wrong, in a way no expert wrote a rule about, and no transformation reveals the wrongness because the wrongness is the same across all transformations.

I'll call this the Class 3 residue, to distinguish it from two simpler cases.

1

Inconsistencies — owned by metamorphic testing.

The agent gives different answers to the same question depending on phrasing or context. Metamorphic testing owns this class completely.

2

Consistent rule-breaks — owned by invariants.

The agent reliably does something an expert said it must never do: always quotes the old refund policy, always skips the gas-safety check. Invariants own this class. Consistency does not protect the agent here. It guarantees the invariant fires every time the agent runs.

3

Consistently wrong in an unanticipated way — the residue.

A stable, plausible, rule-compliant answer that is simply incorrect. Metamorphic passes it. Invariants pass it. This is the true residue, and it cannot be solved from outside without some source of ground truth.

The honest thing is to say so. Three responses to the residue, all of which the industry should do and most of which it does not.

First, convert the residue into Class 2 as it is discovered. Every time a Class 3 failure is found (by a user complaint, an expert review, a customer escalation) the operator writes an invariant that would have caught it. The residue shrinks over time as real failures get promoted into rules. The system gets smarter not by the LLM knowing more, but by the human encoding each discovered failure as a durable rule. Over months, Class 3 erodes into Class 2.

Second, bound the residue with sampled ground truth rather than exhaustive ground truth. You cannot judge correctness on every output, but you do not need to. A small expert-reviewed sample gives you a correctness rate with a confidence interval on the residue, using the operator's own domain experts as the sampling oracle. This is the one place where real domain expertise enters the loop, and it is cheap because it samples rather than exhausts.

Third, and most important, name the residue precisely. No tool catches the consistently-wrong-in-an-unanticipated-way failure without a source of truth. Not us. Not inside-out tracing tools. Not judge-based evaluation. Judge-based evaluation only appears to catch it, and only when the judge happens to be domain-competent, which is precisely the thing that fails in high-stakes domains. The standard pitch of judge-based products is silent on this residue. The honest version isn't.

Naming the blind spot precisely and shrinking it systematically is, I think, a stronger competitive position than pretending to solve everything. Customers, especially technical ones, can smell the difference between a vendor who says "we catch everything" and a vendor who says "here is exactly the class of failure that requires your experts in the loop, here is how we make that loop as small and cheap as possible, and here is the data on how much of the residue we have converted into automated checks over the last quarter." The second pitch is the one that closes the technical buyer.

7. What It Means

Stop competing on judge quality.

If the argument above is right, then the industry's current trajectory in evaluation tooling is mostly working on the wrong axis.

The race for better judges (better calibration, better consensus, better debiasing, better rubrics) is a race within a paradigm that has a structural ceiling. Every product in that race is competing on how to be less bad at a method that cannot get past its competence wall. The race will produce real improvements at the margin. None of those improvements will close the gap between what platform monitoring sees and what users actually experience.

The shift the argument implies is not subtle. Stop competing on judge quality. Start measuring properties that hold regardless of who is judging.

Practically, this means three things for anyone building production AI agents.

One, instrument your agent's behavior the way you would instrument a safety-critical physical system. Domain experts write the rules the system must never break. The rules are checked deterministically wherever possible. The rules are validated under adversarial pressure, not just under happy-path queries. The pass rates on safety-critical rules are reported with confidence intervals, not as binary pass/fail, because for a safety-critical rule, "usually holds" is itself a failure mode.

Two, treat consistency under variation as a first-class signal. The same question phrased two ways, the same scenario in two languages, the same complaint from two demographically different callers: these should produce the same outputs. When they do not, the agent is unstable, and the instability is a real failure even when each individual output looks fine. This is the family of failure that no inside-out tracing tool can ever catch, because the failure exists only in the relationship between outputs.

Three, be honest about the residue. Define the class of failure your current methodology cannot catch. Build a process for shrinking it. Stop letting your customers discover the residue by going down.

For evaluation tooling vendors specifically, and I include the company I run, the implication is sharper. The product cannot just be a better judge. The product must be a portfolio: deterministic invariants for rule-compliance, metamorphic relations for stability and structural correctness, sampled expert review for the residue, all of it surfaced honestly with confidence metadata so the customer knows exactly what is being claimed and what is not. The vendor who builds that portfolio first will have a meaningfully more defensible product than the vendor who builds a marginally better judge.

The reframe is: maybe the model isn't supposed to be the judge at all. Maybe the model is the test-case generator, the transformation engine, and the narrow binary comparator, and the actual correctness signal lives in human-authored rules, mathematical properties of the answer space, and the agent's own behavior compared to itself under transformation.

That reframe is harder to demo than a clever new judge. It does not produce a flashy benchmark improvement. It produces a different way of being right about whether an agent works in production, and the version of evaluation infrastructure that adopts it will be the one still standing when the current judge-quality race exhausts itself.

8. In Production

What it looks like in the field.

I want to close with one concrete picture, because the argument so far has been abstract and the abstraction is doing a lot of work.

Earlier this year we built and shipped a residential-origin probing network. At the time of writing, more than a thousand consumer devices across forty countries, and pointed it at production AI agents. The probes originate from real residential networks, the way users do. The validation runs the methodology described above: invariants, metamorphic relations across three tiers, sampled expert review for residue, with confidence metadata throughout.

1,000+

Consumer devices in the probing network

40

Countries with residential probes

>50%

Runs degraded from outside-in

Across the agents we have validated, a pattern recurs that is, I think, the single most important empirical finding for anyone who currently relies on inside-out monitoring.

The agent passes the operator's own evaluation. It passes their inside-out tracing. It is green on their dashboards. And when we probe it from real residential networks across multiple geographies, we routinely find that materially more than half of the runs come back degraded. Different answers in different geographies. Different behaviors under simple rephrasing. Rule violations the operator's invariant set does not anticipate. None of these are caught by the operator's existing stack, not because the operator's engineers are bad at their jobs, but because the failures live outside what inside-out methods can structurally see.

The agent operators I have shown these findings to fall into two camps. The first camp says, in some form, "this is the failure mode our monitoring doesn't catch." The second camp says, "interesting, we'll build something internally." Both responses are honest. The first camp is right about the gap. The second camp is right that the gap is, in principle, closeable internally. But closing it internally means rebuilding a residential probing network, a metamorphic relation library, an invariant authoring framework, a sampled expert review loop, and the confidence-metadata discipline that makes the numbers honest. That is years of work for a team that did not set out to build a validation product.

The point of this essay is not that operators should buy from us, though some will, and we will be ready when they do. The point is that the methodology is the methodology regardless of who runs it. The eval industry's current trajectory does not produce this methodology. The vendors competing on better judges will eventually have very good judges and the same competence ceiling. The vendors competing on better gold prompts will eventually have very large test sets and the same enumeration ceiling. The wall is structural, the path past it is structural, and the path is open to anyone willing to build along it.

We are publishing this because we think the industry is going to spend another two years working on the wrong problem before noticing, and the cost of that delay shows up in agents that ship broken and users who experience the breakage. The methodology in this essay does not depend on who implements it. We are happy to be the vendor. We will be equally happy to find that ten years from now this is how everyone does it.

Either way, the failures are real, the wall is real, and the way past it is older than language models. Properties that hold regardless of input. Relationships that hold regardless of correct answer. Experts encoding their domain into durable rules rather than disposable test cases. Honest accounting of what is measured and what isn't.

That is the path. We think it is the only one that scales.

More research

Continue reading

March 2026

The State of AI Agent Reliability

We monitored 3,260 production AI agents across 48 countries. 89% with perfect uptime scored 0% on quality. The full data is inside.

Read report

April 2026

The State of AI Agent Drift

88% of agents started giving worse answers at least once in 30 days. A look at how production AI agents drift — and the systemic March 29 event.

Read report

April 2026

The Anti-Synthetic Monitoring Thesis

Why real-user simulation from real devices in real locations is the only monitoring architecture that survives the next generation of AI agents.

Read report

Research

9 Businesses You Can Build on Agent Behavioral Data

Insurance underwriting, credit scores, compliance certification, procurement intelligence — the commercial layer that sits on top of continuous agent monitoring.

Read report

The Two Failures Hiding inLLM-as-a-Judge

51% of validations came back degraded.

Not one operation. Two, stitched together.

The first failure is in the input.

The second failure is in the output.

The failures are multiplicative, not additive.

Untethered from inputs. Untethered from correct answers.

Testing without ground truth.

Tier 1: invariance relations

Tier 2: directional relations

Tier 3: compositional relations

The reframe

Every clean argument has a residue.

Inconsistencies — owned by metamorphic testing.

Consistent rule-breaks — owned by invariants.

Consistently wrong in an unanticipated way — the residue.

Stop competing on judge quality.

What it looks like in the field.

Continue reading

The State of AI Agent Reliability

The State of AI Agent Drift

The Anti-Synthetic Monitoring Thesis

9 Businesses You Can Build on Agent Behavioral Data

The Two Failures Hiding in
LLM-as-a-Judge