Monday, May 18, 2026 · 9:41 AM
ok dumb question: what is ECHO actually doing?
it’s a training trick for terminal agents
normal agent RL mostly grades the commands the model writes
ECHO also makes the model learn from what the terminal says back
like stdout and stack traces?
yep. stdout, stderr, file listings, exit codes, pytest failures, logs, the whole terminal “vibe”
the paper’s slogan is basically: the terminal is already teaching you, stop muting it during training
wait but the agent already sees terminal output during rollouts right?
it sees it as context
but standard GRPO-style training masks those observation tokens out of the loss
so the model uses the traceback to decide the next command, but it isn’t directly trained to predict why that traceback happened
that feels… wasteful?
😮that’s the whole punchline
you already paid for the rollout. every command produced free supervision from the environment
ECHO says: don’t throw that supervision away
give me the analogy
imagine teaching someone to use a woodshop
normal RL is like only scoring the final chair: stable or busted
ECHO also quizzes them on the saw noise, the burn smell, the wobble in the cut, and the “uh oh” sound when the drill slips
so the model learns the physics of the little terminal world
exactly. not real physics, but terminal dynamics
if i run pytest after changing this file, what kind of failure should appear?
if i list this directory, what files probably exist?
if the exit code is nonzero, what clues should stderr contain?
what’s the actual objective?
roughly:
L_ECHO = L_GRPO(actions) + λ · L_env(observations)
keep the normal RL loss on action tokens, then add cross-entropy on the terminal response tokens
so it’s RL plus “predict what the terminal will say”
yep. hybrid objective
the verifier still rewards solving the task, but the observation loss teaches the model the texture of cause and effect along the way
does that actually move benchmark numbers or is this just elegant?
reported numbers look meaningful
it improved Qwen3-8B, OpenThinker-Agent-v1-SFT, and Qwen3-14B across the tested benchmarks
and trained up to 2.3× faster to the same performance
terminalbench?
TerminalBench-2.0 pass@1 nearly doubled in the reported setup
8B: 2.7 → 5.2
14B: 5.2 → 10.8
still low absolute numbers, but doubling is doubling
right. terminal agents are hard mode
the interesting bit is the direction: learning observations seems to make action learning more sample-efficient
how do they know it’s learning terminal dynamics and not just getting lucky?
they track environment-token cross-entropy
with ECHO, that drops sharply
under plain GRPO, it barely moves
so the model literally gets better at predicting terminal output
yep, and that matters because terminal output is the agent’s feedback channel
better world-model-ish prediction → less flailing → better next actions
what’s the weirdest result?
🤯the environment-only version
after RL, they tried a version with no verifier reward, just the observation-learning part
and it still improved in some settings
wait no reward for solving tasks?
that’s the counterintuitive part
it got +3.8 pp on val100, +5.2 pp on ITD, and +10.0 pp on PyTerm after filtering clean tool-call trajectories
not a universal replacement for RL, but a sign that modeling the environment itself carries useful signal
this sounds like SFT without experts?
kind of adjacent
they report ECHO recovering a lot of the benefit of expert SFT without imitation: 104% of the SFT gain on ITD, 89% on Terminal Bench Lite, 50% on TerminalBench-2.0
because the expert is… bash?
lol basically
the terminal is a brutally honest tutor. it does not explain politely, but it always gives consequences
what should agent builders take from this?
if your agent acts in a text world, don’t treat the world’s replies as disposable logs
train on them when they’re causal, grounded, and already in the rollout
shells are the obvious case, but the same idea could matter for browsers, APIs, games, debuggers, maybe even multi-agent chats
so “the environment speaks, make the model listen”
perfect
and maybe make it predict the reply before charging ahead like a caffeinated intern with sudo
noted. sudo intern goes to therapy
growth mindset for shell scripts. ttyl
Read Mon, May 18 · 9:59 AM