ECHO: the terminal is the teacher

‹

Textsplain

explained in texts

Monday, May 18, 2026 · 9:41 AM

ok imagine an agent runs pytest, gets a wall of red, then edits the wrong file

this is not hypothetical. this is my week

yep. the agent acted, the terminal answered, and training mostly treated the answer like background noise

wait, doesn’t RL train from the whole interaction?

standard terminal-agent RL usually trains on the action tokens: commands, edits, tool calls.

terminal output tokens often get masked out of the loss.

masked out as in ignored?

ignored for learning, yeah. stdout, stderr, stack traces, exit codes, test output, logs.

that feels insane. that is where the useful information is

that’s the ECHO complaint.

the terminal is a brutally honest tutor. it says missing import, wrong path, expected 4 got 5, process exited 1.

and normal training covers the student’s ears whenever the tutor talks

basically.

so ECHO just unmasks terminal output?

close. it adds a language-modeling loss on the environment responses.

the short formula is L_ECHO = L_GRPO(actions) + λ · L_env(observations).

so it learns good moves and also what the terminal says back

right. if i run this command, what files appear? if this import is wrong, what traceback happens? if tests pass, what does success look like?

why does predicting an error message help fix bugs though

because it forces the model to learn cause and effect in the terminal world.

the output stops being disposable logs and becomes feedback.

ok, did it actually help or is this elegant paper energy

reported results improved across Qwen3-8B, OpenThinker-Agent-v1-SFT, and Qwen3-14B.

they also report up to about 2.3× faster training to the same performance.

benchmark numbers always make me suspicious

good instinct. the more interesting diagnostic is environment-token cross-entropy.

translation

with ECHO, the model got better at predicting terminal responses. plain GRPO barely moved that needle.

so GRPO learns buttons. ECHO learns what the machine says when you press them

exactly.

but isn’t verifier reward the real teacher? pass or fail, done

that’s the weird part. an environment-only version, without verifier reward, still improved in some settings.

wait what

not a replacement for task reward. but it says the terminal transcript already carries signal.

because a stack trace is the world explaining what your action broke

yes. rude explanation, but useful.

so broader thesis: if an agent acts and the world replies in tokens, train on the replies too

that’s the clean version.

terminal today, maybe browsers, APIs, simulators, games, debuggers tomorrow.

builder takeaway?

don’t throw away observations just because they aren’t actions.

logs, errors, test output, exit codes, file diffs: that’s supervision hiding in plain sight.

less heroic agent, more agent that reads the damn error

yep. make the terminal a tutor, not a mute scoreboard.

Read Mon, May 18 · 9:59 AM