Harness Updating Is Not Harness Benefit: Disentangling Agent Self-Evolution

A self-evolving agent writes its own improvements. It attempts tasks, fails, and distills the lesson back into its own system prompt, skill library, or memory. The agent gets better without retraining. This is the promise of harness self-evolution—and it is one of the most active frontiers in agentic AI research.

Un agente autoevolutivo escribe sus propias mejoras. Intenta tareas, falla y destila la lección de vuelta en su propio prompt de sistema, librería de habilidades o memoria. El agente mejora sin reentrenamiento. Esta es la promesa de la autoevolución del harness—y es una de las fronteras más activas en la investigación de IA agéntica.

But when a self-evolving system improves, where does the gain come from? Is it the evolver that produces better harness updates, or the task-solving agent that uses the updated harness more effectively? End-to-end scores cannot tell you. A new paper—“Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents” (arXiv:2605.30621) by Lin, Wu, Wang et al. (Penn State, UC Santa Cruz, Amazon, UIUC)—systematically decouples these two capabilities and arrives at a counterintuitive result: harness-updating is flat across model tiers, while harness-benefit is non-monotonic.

Pero cuando un sistema autoevolutivo mejora, ¿de dónde viene la ganancia? ¿Es del evolver que produce mejores actualizaciones del harness, o del agente que resuelve tareas y usa el harness actualizado más efectivamente? Los puntajes end-to-end no pueden decirte. Un nuevo paper—“Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents” (arXiv:2605.30621) de Lin, Wu, Wang et al. (Penn State, UC Santa Cruz, Amazon, UIUC)—desacopla sistemáticamente estas dos capacidades y llega a un resultado contraintuitivo: la actualización del harness es plana entre niveles de modelo, mientras que el beneficio del harness es no monótono.

Concretely: a Qwen3.5-9B evolver produces harness updates whose downstream gains match those of Claude Opus 4.6—despite an enormous gap in base capability. But the same harness helps a mid-tier model (GPT-OSS-120B) far more than it helps either a weak model (Qwen3-32B) or a strong one (Opus 4.6). The paper traces weak-tier failures to two concrete modes: failing to load the harness at all (activation failure) and failing to follow it faithfully once loaded (adherence failure).

Concretamente: un evolver Qwen3.5-9B produce actualizaciones de harness cuyas ganancias downstream igualan las de Claude Opus 4.6—a pesar de una enorme brecha en capacidad base. Pero el mismo harness ayuda a un modelo de nivel medio (GPT-OSS-120B) mucho más que a un modelo débil (Qwen3-32B) o uno fuerte (Opus 4.6). El paper rastrea las fallas de nivel débil a dos modos concretos: no cargar el harness en absoluto (fallo de activación) y no seguirlo fielmente una vez cargado (fallo de adherencia).

The Framework: Two Capabilities, One Loop

El Framework: Dos Capacidades, Un Bucle

The paper formalizes harness self-evolution as a loop. At step t, an agent A_t = (f, H_t)—where f is a frozen model backbone and H_t is the external harness state—solves a batch of tasks. The execution trajectories become evidence D_t. An evolver e reads this evidence and produces a harness update ΔH_t = e(H_t-1, D_t), yielding H_t. The loop repeats.

El paper formaliza la autoevolución del harness como un bucle. En el paso t, un agente A_t = (f, H_t)—donde f es un modelo backbone congelado y H_t es el estado externo del harness—resuelve un lote de tareas. Las trayectorias de ejecución se convierten en evidencia D_t. Un evolver e lee esta evidencia y produce una actualización del harness ΔH_t = e(H_t-1, D_t), produciendo H_t. El bucle se repite.

From this loop, the paper defines two distinct capabilities:

De este bucle, el paper define dos capacidades distintas:

Harness-updating (Δ_update): the evolver’s ability to produce harness updates that improve task-solving. Measured as the mean gain across a fixed set of anchor agents. Harness-benefit (Δ_benefit): the agent’s ability to benefit from updated harnesses during task solving. Measured as the max gain across a fixed set of anchor evolvers.

Actualización del harness (Δ_update): la capacidad del evolver para producir actualizaciones del harness que mejoren la resolución de tareas. Se mide como la ganancia media a través de un conjunto fijo de agentes ancla. Beneficio del harness (Δ_benefit): la capacidad del agente para beneficiarse de los harness actualizados durante la resolución de tareas. Se mide como la ganancia máxima a través de un conjunto fijo de evolvers ancla.

The base capability M_base(f) is the agent’s performance without evolution. The key question: do Δ_update and Δ_benefit track M_base? The answer, across seven LLMs and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), is a clear no—in two surprising ways.

La capacidad base M_base(f) es el rendimiento del agente sin evolución. La pregunta clave: ¿Δ_update y Δ_benefit siguen a M_base? La respuesta, a través de siete LLMs y tres benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), es un claro no—de dos maneras sorprendentes.

Finding 1: Harness-Updating Is Flat

Hallazgo 1: La Actualización del Harness es Plana

When you fix the task-solving agent and vary the evolver, the spread of Δ_update across models is at most 3.1 percentage points on any benchmark. No evolver wins across all three benchmarks—Qwen3-235B leads on SWE (8.2pp) but ranks last on MCP (0.6pp). The smallest evolver, Qwen3.5-9B, posts the highest gain on SkillsBench (3.8pp), exceeding both Opus 4.6 (2.3pp) and Qwen3-235B (1.5pp).

Cuando fijas el agente de resolución de tareas y varías el evolver, la dispersión de Δ_update entre modelos es de máximo 3.1 puntos porcentuales en cualquier benchmark. Ningún evolver gana en los tres benchmarks—Qwen3-235B lidera en SWE (8.2pp) pero queda último en MCP (0.6pp). El evolver más pequeño, Qwen3.5-9B, obtiene la ganancia más alta en SkillsBench (3.8pp), superando tanto a Opus 4.6 (2.3pp) como a Qwen3-235B (1.5pp).

A case study reveals why. On a SkillsBench task (flink-query), the skills evolved by Qwen3.5-9B and Opus 4.6 are procedurally isomorphic—they prescribe the same sequence of steps, differing only in surface verbosity. The 9B model reaches the same procedural content as the frontier model. The implication: writing a good skill does not require a strong model. It requires recognizing a failure pattern and distilling the corrective procedure—a capability that saturates early in model scale.

Un caso de estudio revela por qué. En una tarea de SkillsBench (flink-query), las habilidades evolucionadas por Qwen3.5-9B y Opus 4.6 son procedimentalmente isomorfas—prescriben la misma secuencia de pasos, diferenciándose solo en verbosidad superficial. El modelo de 9B alcanza el mismo contenido procedimental que el modelo frontier. La implicación: escribir una buena habilidad no requiere un modelo fuerte. Requiere reconocer un patrón de fallo y destilar el procedimiento correctivo—una capacidad que se satura temprano en la escala del modelo.

Post-evolution performance is dominated by the agent’s base capability, not the evolver’s identity. The within-agent spread across seven evolvers is at most 5.1pp, against a 36pp gap between Opus and Qwen3-235B base capabilities. Even pairing the weakest agent with its best evolver against the strongest agent with its worst evolver, the strong agent still leads by 18.6 to 35.2pp on every benchmark.

El rendimiento post-evolución está dominado por la capacidad base del agente, no por la identidad del evolver. La dispersión intra-agente entre siete evolvers es de máximo 5.1pp, frente a una brecha de 36pp entre las capacidades base de Opus y Qwen3-235B. Incluso emparejando al agente más débil con su mejor evolver contra el agente más fuerte con su peor evolver, el agente fuerte sigue liderando por 18.6 a 35.2pp en cada benchmark.

Finding 2: Harness-Benefit Is Non-Monotonic

Hallazgo 2: El Beneficio del Harness es No Monótono

When you fix the evolver and vary the task-solving agent, Δ_benefit does not increase with base capability. On SWE-bench, the gain peaks at Qwen3-235B (19.3pp), while the weaker Qwen3-32B gains only 4.4pp and the stronger Opus 4.6 gains only 2.6pp. The pattern repeats across benchmarks: mid-tier models (Qwen3-235B, GPT-OSS-120B) benefit most, strong models (Opus 4.6) hit a ceiling, and weak models (Qwen3-32B) benefit the least despite having the most headroom.

Cuando fijas el evolver y varías el agente de resolución de tareas, Δ_benefit no aumenta con la capacidad base. En SWE-bench, la ganancia máxima está en Qwen3-235B (19.3pp), mientras que el más débil Qwen3-32B gana solo 4.4pp y el más fuerte Opus 4.6 gana solo 2.6pp. El patrón se repite en todos los benchmarks: los modelos de nivel medio (Qwen3-235B, GPT-OSS-120B) se benefician más, los modelos fuertes (Opus 4.6) alcanzan un techo, y los modelos débiles (Qwen3-32B) se benefician menos a pesar de tener el mayor margen de mejora.

Why do weak models fail to benefit? The paper identifies two failure modes through a detailed diagnostic framework implemented in the harness-so codebase:

¿Por qué los modelos débiles no logran beneficiarse? El paper identifica dos modos de fallo a través de un framework de diagnóstico detallado implementado en el código harness-so:

1. Harness Activation Failure. Weak models often fail to bring relevant harness artifacts (skills) into their working context. Qwen3-32B has a Skill-Load Rate (SLR) of only 25.1%—it loads a relevant skill only one in four attempts. Strong models hover around 96%. The failure is subtle: Qwen3-32B identifies the right skill, but embeds the load request inside a broader action rather than issuing it as a standalone skill-loading command. The environment never sees a valid load request.

1. Fallo de Activación del Harness. Los modelos débiles a menudo no logran traer artefactos relevantes del harness (habilidades) a su contexto de trabajo. Qwen3-32B tiene una Tasa de Carga de Habilidades (SLR) de solo 25.1%—carga una habilidad relevante solo una de cada cuatro intentos. Los modelos fuertes rondan el 96%. El fallo es sutil: Qwen3-32B identifica la habilidad correcta, pero incrusta la solicitud de carga dentro de una acción más amplia en lugar de emitirla como un comando de carga independiente. El entorno nunca ve una solicitud de carga válida.

2. Harness Adherence Failure. Even when skills are loaded, weak models fail to follow their guidance faithfully. Qwen3-32B has a Harness-Following Rate (HFR) of only 14.2%, compared to Opus 4.6’s 75.7%. Qwen3-235B provides the cleanest separation between activation and adherence: its SLR is 96.1% (near-perfect activation), yet its HFR is only 35.0% (poor adherence). Loading the harness is not sufficient.

2. Fallo de Adherencia del Harness. Incluso cuando las habilidades están cargadas, los modelos débiles no siguen su guía fielmente. Qwen3-32B tiene una Tasa de Seguimiento del Harness (HFR) de solo 14.2%, comparada con el 75.7% de Opus 4.6. Qwen3-235B proporciona la separación más clara entre activación y adherencia: su SLR es 96.1% (activación casi perfecta), pero su HFR es solo 35.0% (mala adherencia). Cargar el harness no es suficiente.

A phase-level adherence analysis reveals the root cause: adherence degrades as trajectories unfold. Qwen3-32B drops from 0.52 after harness loading to 0.13 at the final turn—a drift of -0.39. Opus 4.6 stays stable at 0.89 → 0.80 (drift -0.09). The bottleneck is long-horizon instruction following: weak models initially follow the harness, but progressively lose adherence as the task requires more steps.

Un análisis de adherencia por fases revela la causa raíz: la adherencia se degrada a medida que las trayectorias se desarrollan. Qwen3-32B cae de 0.52 después de cargar el harness a 0.13 en el turno final—un deterioro de -0.39. Opus 4.6 se mantiene estable en 0.89 → 0.80 (deterioro -0.09). El cuello de botella es el seguimiento de instrucciones a largo plazo: los modelos débiles inicialmente siguen el harness, pero pierden progresivamente adherencia a medida que la tarea requiere más pasos.

Design Guidance for Self-Evolving Agents

Guía de Diseño para Agentes Autoevolutivos

The paper translates its findings into three concrete design principles, which map directly to the architecture of the harness-so codebase (implemented as a DSPy module with four lifecycle layers):

El paper traduce sus hallazgos en tres principios de diseño concretos, que se mapean directamente a la arquitectura del código harness-so (implementado como un módulo DSPy con cuatro capas de ciclo de vida):

1. Invest capability budget in the agent, not the evolver. The Δ_update gap across evolvers is at most 3.1pp. Post-evolution performance varies far more with the agent than with the evolver. If you have compute to spend, spend it on a better task-solving backbone—not on a stronger evolver model. The codebase’s EvolutionExperiment makes this measurable: compute_delta_update(e) and compute_delta_benefit(f) quantify exactly where your budget should go.

1. Invierte el presupuesto de capacidad en el agente, no en el evolver. La brecha de Δ_update entre evolvers es de máximo 3.1pp. El rendimiento post-evolución varía mucho más con el agente que con el evolver. Si tienes cómputo para gastar, gástalo en un mejor backbone de resolución de tareas—no en un modelo evolver más fuerte. El EvolutionExperiment del código hace esto medible: compute_delta_update(e) y compute_delta_benefit(f) cuantifican exactamente dónde debería ir tu presupuesto.

2. Bake harness invocation into agent training. Weak models have a 25.1% skill-load rate vs. ~96% for strong models. The diagnostics.py module in the codebase measures this precisely via compute_skill_load_rate() and compute_harness_following_rate(). These metrics should be first-class training targets, not afterthoughts. An agent that cannot reliably load the harness cannot benefit from evolution, regardless of how good the evolver is.

2. Incorpora la invocación del harness en el entrenamiento del agente. Los modelos débiles tienen una tasa de carga de habilidades del 25.1% frente al ~96% de los modelos fuertes. El módulo diagnostics.py en el código mide esto precisamente mediante compute_skill_load_rate() y compute_harness_following_rate(). Estas métricas deberían ser objetivos de entrenamiento de primera clase, no ocurrencias tardías. Un agente que no puede cargar el harness de manera confiable no puede beneficiarse de la evolución, independientemente de lo bueno que sea el evolver.

3. Strengthen long-horizon instruction following. Phase-level adherence drifts by -0.39 for weak models vs. -0.09 for strong models across the trajectory. The compute_phase_adherence() function in the codebase measures adherence at different trajectory stages (post-load, mid-turn, final-turn). Training should target sustained instruction following over multi-step tasks—not just single-turn compliance.

3. Fortalece el seguimiento de instrucciones a largo plazo. La adherencia por fases se deteriora en -0.39 para modelos débiles vs. -0.09 para modelos fuertes a lo largo de la trayectoria. La función compute_phase_adherence() en el código mide la adherencia en diferentes etapas de la trayectoria (post-carga, mitad, final). El entrenamiento debería apuntar al seguimiento sostenido de instrucciones en tareas multi-paso—no solo al cumplimiento de un solo turno.

The Codebase: Life-Harness on DSPy

El Código: Life-Harness sobre DSPy

The harness-so repository implements both the original Life-Harness framework (arXiv:2605.22166) and the experimental methodology from this paper. The architecture wraps agents in four lifecycle layers implemented as a DSPy module:

El repositorio harness-so implementa tanto el framework original Life-Harness (arXiv:2605.22166) como la metodología experimental de este paper. La arquitectura envuelve agentes en cuatro capas de ciclo de vida implementadas como un módulo DSPy:

- H3 Contract Layer: Enhances tool descriptions with policy constraints before interaction

H5 Skill Layer: BM25 retrieval injects relevant procedural skills into the system prompt
H2 Action Layer: Validates or blocks actions before execution
H4 Trajectory Layer: Detects repetition, stagnation, and budget exhaustion during execution

- Capa de Contrato H3: Mejora las descripciones de herramientas con restricciones de política antes de la interacción

Capa de Habilidades H5: Recuperación BM25 inyecta habilidades procedimentales relevantes en el prompt del sistema
Capa de Acción H2: Valida o bloquea acciones antes de la ejecución
Capa de Trayectoria H4: Detecta repetición, estancamiento y agotamiento de presupuesto durante la ejecución

The LLMEvolver uses an LLM to read execution trajectories and propose structured harness updates (new skills, prompt hints, memories). The EvolutionExperiment runs controlled agent-evolver pairings and computes the paper’s metrics. The diagnostics module measures SLR, HFR, and phase-level adherence. The entire pipeline is composable via DSPy’s module system and optimizable with GEPA, MIPROv2, or any DSPy optimizer.

El LLMEvolver usa un LLM para leer trayectorias de ejecución y proponer actualizaciones estructuradas del harness (nuevas habilidades, pistas de prompt, memorias). El EvolutionExperiment ejecuta emparejamientos controlados agente-evolver y computa las métricas del paper. El módulo de diagnóstico mide SLR, HFR y adherencia por fases. Todo el pipeline es componible mediante el sistema de módulos de DSPy y optimizable con GEPA, MIPROv2 o cualquier optimizador de DSPy.

Key Numbers

Números Clave

- 3.1pp max Δ_update spread across evolvers on any benchmark—harness-updating is flat

9B model matches Opus 4.6 as an evolver (procedurally isomorphic skills)
19.3pp Δ_benefit peak for Qwen3-235B (mid-tier) on SWE-bench
4.4pp Δ_benefit for Qwen3-32B (weak-tier) on SWE-bench—despite most headroom
25.1% skill-load rate for Qwen3-32B vs. ~96% for strong models
14.2% harness-following rate for Qwen3-32B vs. 75.7% for Opus 4.6
-0.39 adherence drift for weak models vs. -0.09 for strong models across trajectory
7 LLMs, 3 benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench)
Python 3.12+, DSPy 3.2+, MIT-style research license

- 3.1pp máxima dispersión de Δ_update entre evolvers en cualquier benchmark—la actualización del harness es plana

9B modelo iguala a Opus 4.6 como evolver (habilidades procedimentalmente isomorfas)
19.3pp pico de Δ_benefit para Qwen3-235B (nivel medio) en SWE-bench
4.4pp Δ_benefit para Qwen3-32B (nivel débil) en SWE-bench—a pesar de tener el mayor margen
25.1% tasa de carga de habilidades para Qwen3-32B vs. ~96% para modelos fuertes
14.2% tasa de seguimiento del harness para Qwen3-32B vs. 75.7% para Opus 4.6
-0.39 deterioro de adherencia para modelos débiles vs. -0.09 para modelos fuertes
7 LLMs, 3 benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench)
Python 3.12+, DSPy 3.2+, licencia MIT-style para investigación

References

Referencias

Lin, M., Wu, J., Wang, Z., et al. (2026). Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents. arXiv:2605.30621. arxiv.org/abs/2605.30621
Xu, T., Wen, H., & Li, M. (2026). Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents (Life-Harness). arXiv:2605.22166. arxiv.org/abs/2605.22166
Reference Code: github.com/A-EVO-Lab/a-evolve
Agrawal, L. A., et al. (2026). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. ICLR 2026.
Xia, C., et al. (2026). Trace2Skill: Towards Generalizable Compositional Skills in Agentic Systems. arXiv:2603.25158.
Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.

Lin, M., Wu, J., Wang, Z., et al. (2026). Actualizar el Harness No Es Beneficio del Harness: Desenredando las Capacidades de Evolución en Agentes LLM Autoevolutivos. arXiv:2605.30621. arxiv.org/abs/2605.30621
Xu, T., Wen, H., & Li, M. (2026). Adaptando la Interfaz, No el Modelo: Adaptación de Harness en Tiempo de Ejecución para Agentes LLM Deterministas (Life-Harness). arXiv:2605.22166. arxiv.org/abs/2605.22166
Código de Referencia: github.com/A-EVO-Lab/a-evolve
Agrawal, L. A., et al. (2026). GEPA: La Evolución Reflexiva de Prompts Puede Superar al Aprendizaje por Refuerzo. ICLR 2026.
Xia, C., et al. (2026). Trace2Skill: Hacia Habilidades Composicionales Generalizables en Sistemas Agénticos. arXiv:2603.25158.
Madaan, A., et al. (2023). Self-Refine: Refinamiento Iterativo con Auto-Retroalimentación. NeurIPS 2023.

Harness Updating Is Not Harness Benefit: Disentangling Agent Self-Evolution

Related posts