Self-Distillation: The Model as Its Own Teacher

The Reward Bottleneck

El Cuello de Botella de la Recompensa

Post-training language models with reinforcement learning requires reward signals. The standard approach — RL with Verifiable Rewards (RLVR) — reduces every attempt to a single scalar: pass or fail, correct or incorrect. This creates a severe credit assignment problem. A model that writes the right logic with a syntax error receives the same reward as one that generates random tokens. The outcome is binary, but the reasoning path is not.

El post-entrenamiento de modelos de lenguaje con reinforcement learning requiere señales de recompensa. El enfoque estándar — RL con Recompensas Verificables (RLVR) — reduce cada intento a un solo escalar: pasa o falla, correcto o incorrecto. Esto crea un problema severo de asignación de crédito. Un modelo que escribe la lógica correcta con un error de sintaxis recibe la misma recompensa que uno que genera tokens aleatorios. El resultado es binario, pero el camino de razonamiento no lo es.

Many verifiable environments actually provide rich textual feedback — runtime errors, compiler output, judge evaluations — that explain why an attempt failed. The question is how to convert this feedback into a learning signal without an external reward model or human annotation. The answer emerging across multiple research groups in 2025-2026 is self-distillation: the model becomes its own teacher by conditioning on feedback and distilling the hindsight distribution back into the policy.

Muchos entornos verificables realmente proporcionan retroalimentación textual rica — errores de runtime, salida del compilador, evaluaciones del juez — que explican por qué falló un intento. La pregunta es cómo convertir esta retroalimentación en una señal de aprendizaje sin un modelo de recompensa externo ni anotación humana. La respuesta que emerge de múltiples grupos de investigación en 2025-2026 es la auto-destilación: el modelo se convierte en su propio maestro condicionando en la retroalimentación y destilando la distribución retrospectiva de vuelta a la política.

SDPO: Learning from Feedback Without a Reward Model

SDPO: Aprendiendo de la Retroalimentación Sin un Modelo de Recompensa

SDPO (Self-Distillation Policy Optimization, Hübotter et al., Jan 2026) formalizes this setting as reinforcement learning with rich feedback. The core insight: when a model generates a code solution that fails a test, and then sees the error message, the same model can often identify and correct its mistake in context. The model after seeing feedback is a better version of itself — a self-teacher.

SDPO (Optimización de Política por Auto-Destilación, Hübotter et al., Ene 2026) formaliza este entorno como reinforcement learning con retroalimentación rica. La idea central: cuando un modelo genera una solución de código que falla una prueba, y luego ve el mensaje de error, el mismo modelo a menudo puede identificar y corregir su error en contexto. El modelo después de ver la retroalimentación es una versión mejor de sí mismo — un auto-maestro.

SDPO works in two steps. First, it conditions the current model on the feedback (error message, judge evaluation) and computes the feedback-informed token distribution — what the model would predict if it knew what went wrong. Then it distills this distribution back into the unconditional policy using KL divergence. No external teacher, no reward model, no human labels — just the model’s own hindsight.

SDPO funciona en dos pasos. Primero, condiciona el modelo actual en la retroalimentación (mensaje de error, evaluación del juez) y computa la distribución de tokens informada por retroalimentación — lo que el modelo predeciría si supiera qué salió mal. Luego destila esta distribución de vuelta a la política incondicional usando divergencia KL. Sin maestro externo, sin modelo de recompensa, sin etiquetas humanas — solo la retrospectiva del propio modelo.

The results across scientific reasoning, tool use, and competitive programming (LiveCodeBench v6) show consistent improvements in sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also improves performance even in environments that return only scalar feedback — by treating successful rollouts as implicit positive feedback for failed attempts on the same question. At test time, applying SDPO per-question achieves the same discovery probability as best-of-k sampling with 3x fewer attempts.

Los resultados en razonamiento científico, uso de herramientas y programación competitiva (LiveCodeBench v6) muestran mejoras consistentes en eficiencia de muestras y precisión final sobre fuertes líneas base de RLVR. Notablemente, SDPO también mejora el rendimiento incluso en entornos que solo devuelven retroalimentación escalar — tratando los rollouts exitosos como retroalimentación positiva implícita para intentos fallidos en la misma pregunta. En tiempo de prueba, aplicar SDPO por pregunta logra la misma probabilidad de descubrimiento que best-of-k sampling con 3x menos intentos.

SDFT: On-Policy Learning from Demonstrations

SDFT: Aprendizaje On-Policy a partir de Demostraciones

SDFT (Self-Distillation Fine-Tuning, Shenfeld et al., Jan 2026) addresses a related problem: how to learn from expert demonstrations without the forgetting that plagues supervised fine-tuning (SFT). SFT is inherently off-policy — it maximizes the likelihood of demonstration tokens regardless of whether the model would have generated them. This creates distribution mismatch and catastrophic forgetting of prior capabilities.

SDFT (Fine-Tuning por Auto-Destilación, Shenfeld et al., Ene 2026) aborda un problema relacionado: cómo aprender de demostraciones de expertos sin el olvido que afecta al supervised fine-tuning (SFT). SFT es inherentemente off-policy — maximiza la verosimilitud de los tokens de demostración independientemente de si el modelo los habría generado. Esto crea un desajuste de distribución y olvido catastrófico de capacidades previas.

SDFT converts demonstration learning into an on-policy process by using the demonstration-conditioned model as its own teacher. Given a demonstration of a new skill, the model conditions on it and generates on-policy rollouts. The distribution of these feedback-informed rollouts serves as the training target. Because the learning signal comes from the model’s own on-policy distribution — not from a static dataset — it preserves prior capabilities while acquiring the new one.

SDFT convierte el aprendizaje de demostraciones en un proceso on-policy usando el modelo condicionado por la demostración como su propio maestro. Dada una demostración de una nueva habilidad, el modelo se condiciona en ella y genera rollouts on-policy. La distribución de estos rollouts informados por retroalimentación sirve como objetivo de entrenamiento. Debido a que la señal de aprendizaje proviene de la propia distribución on-policy del modelo — no de un dataset estático — preserva capacidades previas mientras adquiere la nueva.

In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression. This is continual learning from demonstrations without replay buffers, without explicit regularization, without task boundaries — just the model’s own hindsight distribution, applied iteratively.

En experimentos de aprendizaje secuencial, SDFT permite que un solo modelo acumule múltiples habilidades a lo largo del tiempo sin regresión de rendimiento. Esto es aprendizaje continuo a partir de demostraciones sin buffers de repetición, sin regularización explícita, sin límites de tarea — solo la distribución retrospectiva del propio modelo, aplicada iterativamente.

Learning from User Interactions

Aprendiendo de Interacciones de Usuario

The same self-distillation principle extends to the most abundant data source available to deployed language models: multi-turn user interactions (Kleine Buening et al., Feb 2026). When a user sends a follow-up message after a model’s response, that follow-up often contains implicit feedback — a correction, a clarification, a signal that the original response was insufficient.

El mismo principio de auto-destilación se extiende a la fuente de datos más abundante disponible para modelos de lenguaje desplegados: interacciones de usuario multi-turno (Kleine Buening et al., Feb 2026). Cuando un usuario envía un mensaje de seguimiento después de la respuesta de un modelo, ese seguimiento a menudo contiene retroalimentación implícita — una corrección, una aclaración, una señal de que la respuesta original fue insuficiente.

The method is elegant: condition the model on the user’s follow-up, compute the hindsight token distribution (what the model would have said knowing what the user would respond), and distill this back into the unconditional policy. Training on real-world WildChat conversations improves standard alignment and instruction-following benchmarks without regressing other capabilities. The same mechanism enables personalization — models adapt to individual users through interaction history without explicit feedback or preference data.

El método es elegante: condiciona el modelo en el seguimiento del usuario, computa la distribución retrospectiva de tokens (lo que el modelo habría dicho sabiendo lo que el usuario respondería), y destila esto de vuelta a la política incondicional. El entrenamiento en conversaciones reales de WildChat mejora los benchmarks estándar de alineamiento y seguimiento de instrucciones sin regresar otras capacidades. El mismo mecanismo permite la personalización — los modelos se adaptan a usuarios individuales a través del historial de interacción sin retroalimentación explícita ni datos de preferencia.

Why Online Reinforcement Learning Forgets Less

Por Qué el Reinforcement Learning Online Olvida Menos

A parallel line of work (RL’s Razor, Kleine Buening et al., Sep 2025) provides the theoretical grounding for why these self-distillation methods work. The key finding: online RL forgets less than offline methods because the training distribution is tied to the current policy. When the policy shifts, online methods generate new data from the shifted policy, creating a natural curriculum. Offline methods (SFT, DPO) optimize against a fixed dataset, so any distribution shift between the dataset and the current policy produces conflicting gradients.

Una línea de trabajo paralela (RL’s Razor, Kleine Buening et al., Sep 2025) proporciona la base teórica de por qué estos métodos de auto-destilación funcionan. El hallazgo clave: el RL online olvida menos que los métodos offline porque la distribución de entrenamiento está ligada a la política actual. Cuando la política cambia, los métodos online generan nuevos datos desde la política cambiada, creando un currículo natural. Los métodos offline (SFT, DPO) optimizan contra un dataset fijo, por lo que cualquier desviación de distribución entre el dataset y la política actual produce gradientes conflictivos.

Self-distillation methods inherit this advantage. By generating training signals from the model’s current on-policy distribution (conditioned on feedback), they avoid the distribution mismatch that causes forgetting. The feedback conditions the distribution, but the policy generates the tokens — keeping the learning signal grounded in what the model can actually produce.

Los métodos de auto-destilación heredan esta ventaja. Al generar señales de entrenamiento desde la distribución on-policy actual del modelo (condicionada en retroalimentación), evitan el desajuste de distribución que causa el olvido. La retroalimentación condiciona la distribución, pero la política genera los tokens — manteniendo la señal de aprendizaje arraigada en lo que el modelo puede realmente producir.

Test-Time Adaptation

Adaptación en Tiempo de Prueba

Active Fine-Tuning (AFT, Hübotter et al., Oct 2024) extends this paradigm to test time: instead of distilling into a static policy, the model actively fine-tunes itself during evaluation by generating attempts, observing outcomes, and updating. On difficult binary-reward tasks, this achieves the same discovery probability as best-of-k with 3x fewer attempts — matching the SDPO finding independently.

El Active Fine-Tuning (AFT, Hübotter et al., Oct 2024) extiende este paradigma al tiempo de prueba: en lugar de destilar en una política estática, el modelo se afina activamente durante la evaluación generando intentos, observando resultados y actualizando. En tareas difíciles de recompensa binaria, esto logra la misma probabilidad de descubrimiento que best-of-k con 3x menos intentos — coincidiendo con el hallazgo de SDPO de forma independiente.

The convergence of these results — from SDPO, SDFT, user interaction alignment, RL’s Razor, and AFT — points to a coherent picture. Self-distillation with on-policy feedback signals is a general mechanism for improving language models that applies across training regimes (pre-training, fine-tuning, test-time) and data sources (verifiable rewards, demonstrations, user interactions). The key ingredients are: a model that can condition on feedback to produce a better distribution, and a distillation objective that transfers this improvement back into the unconditional policy.

La convergencia de estos resultados — de SDPO, SDFT, alineamiento de interacciones de usuario, RL’s Razor y AFT — apunta a una imagen coherente. La auto-destilación con señales de retroalimentación on-policy es un mecanismo general para mejorar modelos de lenguaje que se aplica a través de regímenes de entrenamiento (pre-entrenamiento, fine-tuning, tiempo de prueba) y fuentes de datos (recompensas verificables, demostraciones, interacciones de usuario). Los ingredientes clave son: un modelo que puede condicionarse en la retroalimentación para producir una distribución mejor, y un objetivo de destilación que transfiere esta mejora de vuelta a la política incondicional.

The Engine That Never Stops

El Motor Que Nunca Se Detiene

Every user interaction, every compiler error, every failed test, every follow-up question — each is a potential learning signal. The self-distillation framework converts the model’s own deployment into a continuous training loop. The model generates, receives feedback, conditions on it, and distills the improvement. No reward model. No human annotators. No static dataset. Just the model, its mistakes, and its ability to do better in hindsight.

Cada interacción de usuario, cada error de compilador, cada prueba fallida, cada pregunta de seguimiento — cada una es una señal de aprendizaje potencial. El framework de auto-destilación convierte el propio despliegue del modelo en un bucle de entrenamiento continuo. El modelo genera, recibe retroalimentación, se condiciona en ella y destila la mejora. Sin modelo de recompensa. Sin anotadores humanos. Sin dataset estático. Solo el modelo, sus errores y su capacidad de hacerlo mejor en retrospectiva.

References

Referencias

Hübotter, J. et al. (2026). *Reinforcement Learning via Self-Distillation*. arxiv.org/abs/2601.20802
Shenfeld, I. et al. (2026). *Self-Distillation Enables Continual Learning*. arxiv.org/abs/2601.19897
Kleine Buening, T. et al. (2026). *Aligning Language Models from User Interactions*. arxiv.org/abs/2603.12273
Kleine Buening, T. et al. (2025). *RL's Razor: Why Online Reinforcement Learning Forgets Less*. arxiv.org/abs/2509.04259
Hübotter, J. et al. (2024). *Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs*. arxiv.org/abs/2410.08020

Hübotter, J. et al. (2026). *Reinforcement Learning via Self-Distillation*. arxiv.org/abs/2601.20802
Shenfeld, I. et al. (2026). *Self-Distillation Enables Continual Learning*. arxiv.org/abs/2601.19897
Kleine Buening, T. et al. (2026). *Aligning Language Models from User Interactions*. arxiv.org/abs/2603.12273
Kleine Buening, T. et al. (2025). *RL's Razor: Why Online Reinforcement Learning Forgets Less*. arxiv.org/abs/2509.04259
Hübotter, J. et al. (2024). *Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs*. arxiv.org/abs/2410.08020