
During training, the same model plays two roles. A teacher version is conditioned on both the query and expert examples. A student version sees only the query, reflecting real-world deployment. The student updates its parameters to align with the teacher’s predictions on its own generated outputs.
“In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations,” the researchers said.
Challenges to overcome
SDFT appears quite realistic as the technique removes the need for maintaining “model zoos” of separate adapters or fine-tuned variants, according to Lian Jye Su, chief analyst at Omdia.
