Wednesday, May 20, 2026 · 9:41 AM
ok dumb question: what are “evals” in AI apps besides everyone saying the word evals a lot
lol fair
an eval is how you decide whether an AI system’s output is good enough for the thing you’re trying to ship
not “did the code run,” but “did the assistant actually answer correctly, safely, in the right format, for this use case?”
so like QA, but for vibes?
QA, but the product is a slippery little language goblin
the article frames evals inside the AI Engineering Loop: production traces + monitoring feed datasets, experiments, evaluation, then shipping, then more production data
it’s not a one-time test suite. it’s the loop that keeps the system improving
where exactly does evaluation sit in that loop?
offline eval is the checkpoint between “we ran an experiment” and “we ship this change”
you have a dataset, you run the app against it, then you judge whether the outputs are good
think airport security for model changes: bags go through before the plane takes off
ok so first step is build automated evaluators?
😅that’s the trap
the article is very explicit: start by manually reviewing outputs
you need to build intuition for what good and bad look like in your actual app before you automate anything
wait isn’t manual review the thing you do before you have your act together?
counterintuitive part: manual review is not the kiddie pool
it’s how you discover the failure modes worth caring about
if you skip it, you often end up measuring things that don’t matter
like making a perfect thermometer for soup saltiness when the issue is actually “there’s a sock in the soup”
exactly. tragic soup, excellent eval analogy
manual review gives you the labels and examples that later become ground truth for checking your automated evals
and in production, humans should still review samples to catch new failure modes and keep evaluators calibrated
so what are the actual evaluation methods?
the article splits them into three buckets: manual, code-based, and LLM-as-a-judge
manual = humans read outputs and score or write notes
code = deterministic checks
LLM judge = another model scores language-y qualities
code-based sounds safest
it’s safest when the thing is objective
valid JSON? schema followed? length under limit? banned keyword absent? generated SQL executes?
code checks are fast, cheap, and repeatable
but they can’t tell if the refund explanation is actually right
yep. code can check that the word “refund” appears
it cannot understand whether the policy was explained correctly
that’s where LLM judges become useful
LLM judging LLM feels cursed though
🤯healthy suspicion
LLM judges are for language qualities: relevance, tone, summary completeness, audience fit, stuff like that
but they are imperfect. they need calibration against human preferences, and they can share blind spots with the app model
so don’t just say “judge this response 1-5 for helpfulness” and call it science?
please do not summon the Helpful-O-Meter 3000
the article says vague criteria like “helpfulness” or “quality” usually give vague signal
you want precise definitions of good and bad for your app
what’s better than a 1–5 score?
often: binary pass/fail
it forces you to define the line between acceptable and unacceptable
a 3 vs 4 sounds precise, but usually hides disagreement and drift over time
what about reference-based vs reference-free? i see those terms everywhere
reference-based means you compare against a known answer or golden response
reference-free means you judge the output on its own
reference-free is especially useful for unseen production data, because real users rarely arrive with golden answers attached
when should i actually create an evaluator?
ask: is this a one-time fix, or a generalization problem?
if a prompt tweak fixes a weird isolated case, just fix it
if it’s a recurring failure mode you need to test across many inputs or over time, build an evaluator
so each quality gets its own evaluator?
each quality you care about gets its own evaluator
mature setups combine all three: humans, code, and LLM judges
the trick is matching the method to the quality, not making one giant mystical score
what’s the monday morning version?
1. review outputs manually
2. write down specific failure modes
3. define pass/fail criteria clearly
4. automate only the checks you’ll need repeatedly
and after shipping?
the loop starts again
new traces, monitoring signals, user feedback, production-safe checks
if prod behaves differently than offline evals predicted, capture those cases, add them to datasets, and run the next experiment
tl;dr?
🔥don’t automate confusion
start with human eyeballs, name the recurring failures, then build targeted evaluators
evals are not the scoreboard at the end. they’re the steering wheel for the AI engineering loop
steering wheel > mystical quality blob, got it
precisely. go forth and make fewer haunted soup detectors
Read Wed, May 20 · 10:04 AM