On-policy distillation: teaching a model inside its own mess
Regular distillation shows the student the teacher's clean answer. On-policy distillation lets the student act first, then teaches it inside the state it actually created.
On-policy distillation: teaching a model inside its own mess
Large models have a strange training problem. During training, the world is clean. At deployment, the world is whatever the model just wrote.
The dataset does not usually contain the half-broken sentence the model produced three seconds ago, the wrong file path it invented, or the tool call it botched before trying to recover. But real generation works exactly like that. A language model moves forward one token at a time. If it drifts early, the next step happens inside the drift.
On-policy distillation is about training for that world. Not just the teacher's perfect answer to a clean prompt, but the teacher's behavior when the student has already created the state it now has to survive.
Start with the words
Distillation means using a stronger model as a teacher for a smaller or cheaper student model. The student is not only learning facts. It is learning style, judgment, corrections, probability mass, and the habits that make the teacher useful.
Policy comes from reinforcement learning. For a language model, it roughly means the model's current way of choosing what to write next.
On-policy means the data comes from the current student model's own behavior. The student does not only learn from a fixed pile of teacher-written answers. It first tries the task itself. Then the teacher looks at the student's draft, mistake, partial solution, or trajectory and teaches from there.
Regular distillation is answer-copying
A basic distillation loop is simple:
- Give the teacher a prompt.
- The teacher writes a good answer.
- The student learns to imitate that answer.
This works. It is stable, cheap enough to scale, and easy to run offline. Many smaller models learn a surprising amount this way.
But it has a blind spot. The student sees how the teacher behaves on clean inputs. It does not learn how the teacher would recover from the student's own bad intermediate states.
It is like learning to drive by watching an instructor park perfectly in an empty lot. Useful, yes. But on the road, your car is already crooked, someone is honking, and a scooter has slipped into the gap. What you need then is not the ideal move. You need the recovery move.
On-policy distillation is draft correction
The on-policy version feels more like actual tutoring:
- Give the student a task.
- Let the student generate an answer, a partial answer, or a full trajectory.
- Show that state to the teacher.
- The teacher improves it, scores it, chooses between alternatives, or explains where it went wrong.
- The student trains on the teacher's behavior in that student-created state.
The key detail is that the state comes from the student.
Imagine a coding model. It writes a function that mostly works but fails on an empty input. A normal distillation dataset may only show the final correct function. On-policy distillation can show the teacher reading the flawed function and saying: this edge case breaks, this return type is wrong, add this test, change this branch. That is closer to what the model needs in production.
The enemy is distribution shift
Training and deployment do not expose the model to the same distribution. During supervised training, the model sees human answers, teacher answers, filtered data, clean transcripts. During inference, it sees its own previous tokens.
Those are different worlds.
For short answers, the gap may not matter much. For coding agents, browser agents, multi-step tool use, file edits, and long research tasks, it matters a lot. One wrong file path leads to the wrong patch. The wrong patch leads to the wrong test failure. The wrong test failure sends the model toward the wrong fix. Soon the model is not failing because it lacks knowledge. It is failing because it cannot climb out of the state it created.
On-policy distillation attacks that failure mode directly. It trains the student on the places the student actually goes.
How it relates to RLHF, DPO, and PPO
The word on-policy comes from reinforcement learning, so it naturally sits near RLHF and online RL. PPO-style training lets the current model generate responses, scores them with a reward model or preference signal, and updates the model from that fresh data.
On-policy distillation is not one single algorithm. It is a training pattern. The teacher can rewrite the student's answer. It can choose the best response from several student samples. It can create preference pairs for DPO-style training. It can take over a failed agent trajectory and leave behind a corrected trace.
If SFT is "learn the reference answer," DPO is "learn which answer is preferred," and PPO is "move toward higher reward," on-policy distillation is the practical instruction underneath: learn from the situations your own policy creates.
Why teams pay for it
It is expensive. The student has to sample. The teacher has to read those samples. Sometimes the teacher has to repair them. As the student changes, the data distribution changes too, so old data becomes less representative. The pipeline needs sampling, filtering, scoring, distilling, and sampling again.
Still, teams use it because offline data cannot fully solve the problem of self-generated context. Once a model starts acting, its actions shape the next input.
This is especially true for agents. A model can be good at single-turn answers and still fall apart over a three-hour task. Long-horizon competence is not just knowing what to do. It is noticing when you have gone off track, stopping, repairing the state, and continuing.
The most valuable thing to distill from a strong teacher may not be the polished final answer. It may be the recovery behavior in the middle.
The short version
On-policy distillation is not asking a student to memorize the teacher's perfect essay.
It is letting the student write first. Then the teacher marks up the student's real draft: this is where you go wrong, and this is how you recover.
As models move toward agents, code work, tools, and long tasks, that distinction matters. A deployed model does not live inside the training set. It lives inside its own outputs.
More from WayDigital
Continue through other published articles from the same publisher.
Comments
0 public responses
All visitors can read comments. Sign in to join the discussion.
Log in to comment