Evals, explained — textsplain

‹

Textsplain

explained in texts

Wednesday, May 20, 2026 · 9:41 AM

my AI app got worse and i can’t tell why

classic evals moment.

i changed the prompt, it felt better in the demo, then users got weird answers

the demo lied politely.

rude but accurate

evals are the guardrail between “i tried a thing” and “users now live with this thing.”

in the AI engineering loop, production traces feed datasets, datasets feed experiments, experiments get evaluated, good changes ship, then production gives you new traces.

so evals are not leaderboard cosplay

right. for apps, evals are behavior regression tests.

what does one actually do

take real-ish cases, run your app on them, judge the outputs.

judge how, without lying to myself

start manually. read outputs. mark what’s good, what’s broken, what keeps repeating.

annoying. i wanted to automate the annoying part

you can, after you know what the annoying part is.

manual review is how you learn the failure modes before you build machines to detect them.

ok then the machines are what, code and LLM judges?

three buckets: human review, code checks, LLM-as-judge.

LLM-as-judge sounds like vibes with an API key

it can be. that’s why you calibrate it against human labels.

LLM judges are useful for meaning: relevance, tone, summary quality, following nuanced instructions.

and code checks?

schemas, keywords, SQL executes, length limits, forbidden fields. cheap, deterministic, boring in the best way.

but code can’t tell if the answer is actually helpful

exactly. code can say “5 bullets.” it can’t reliably say “not misleading.”

what about perfect expected answers? do i need those?

sometimes. reference-based evals compare against an expected answer.

reference-free evals judge the output on its own, which helps with messy production traces.

so production traces become eval data?

that’s the loop. trace weird behavior, turn it into dataset rows, test experiments against it before shipping.

when do i create an automated evaluator?

when a failure mode repeats and you need to test it over and over.

if one prompt fix solves it, just fix the prompt. don’t build a shrine.

what makes an evaluator not useless

specific criteria. “helpfulness 1-5” is mush. “answer cites the refund window correctly: pass/fail” is useful.

binary pass/fail often beats vague graded scales.

tomorrow version?

pull 30-50 real examples. run old and new versions side by side. read them.

write failure labels in plain English: missed policy, too verbose, invented SQL column, wrong tone.

then automate only those labels that keep showing up

yep. code for mechanical rules, LLM judge for meaning, calibrate both against human review.

and after shipping?

keep watching production. new failures become the next dataset.

less “make model smart,” more “catch dumb regressions before users do”

that’s evals.

Read Wed, May 20 · 10:04 AM