Don't Train the Model, Train the Harness: Runtime Interface Adaptation for LLM Agents

At Peking University, Xu, Wen, and Li just published a paper that reframes a question the agentic AI field has been asking wrong. The question is: how do we make LLM agents better at deterministic tasks? The standard answer has been: train a better model. Fine-tune. RLHF. Distillation. Preference optimization.

The paper — Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents — argues this is backwards. For deterministic, rule-governed domains, most agent failures aren’t reasoning failures. They’re interface mismatches between the model and the environment. The model could solve the task — it just can’t see the tools properly, or its actions are malformed, or it’s stuck in a repetition loop.

So they fix the interface, not the model.

En la Universidad de Pekín, Xu, Wen y Li acaban de publicar un artículo que replantea una pregunta que el campo de la IA agéntica ha estado formulando mal. La pregunta es: ¿cómo hacemos que los agentes LLM sean mejores en tareas deterministas? La respuesta estándar ha sido: entrena un modelo mejor. Fine-tuning. RLHF. Destilación. Optimización de preferencias.

El artículo — Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents — argumenta que esto está al revés. Para dominios deterministas gobernados por reglas, la mayoría de los fallos de agentes no son fallos de razonamiento. Son desajustes de interfaz entre el modelo y el entorno. El modelo podría resolver la tarea — simplemente no puede ver las herramientas correctamente, o sus acciones están mal formadas, o está atrapado en un bucle de repetición.

Así que arreglan la interfaz, no el modelo.

1. The Failure Diagnosis

1. El Diagnóstico de Fallos

The authors ran Qwen3-4B-Instruct across seven deterministic environments from τ-bench, τ²-bench, and AgentBench — covering household interaction, web shopping, OS control, database tasks, and policy-guided workflows. They manually inspected every failed trajectory and classified the bottleneck.

The taxonomy is striking. Failures fall into four categories, and none of them is “the model doesn’t know enough”:

Action Realization Failures (34%) — The model’s intent is plausible, but the action isn’t executable: free-form text instead of a tool call, missing arguments, wrong argument types.
Environment Contract Mismatches (27%) — The action is syntactically valid but violates the tool’s intended usage protocol: calling tools in the wrong order, incorrect API semantics, misunderstanding feedback formats.
Trajectory Degeneration (22%) — Individual actions are fine, but the episode falls into repetition, stagnation, or ineffective recovery loops.
Residual Reasoning Failures (17%) — Actual incorrect inference or decision-making despite correctly following the protocol.

The distribution varies dramatically by environment — in ALFWorld, action realization dominates; in τ-bench policy domains, contract mismatches are the primary bottleneck. But across the board, 83% of failures are interface problems, not reasoning deficits.

Los autores ejecutaron Qwen3-4B-Instruct en siete entornos deterministas de τ-bench, τ²-bench y AgentBench — cubriendo interacción doméstica, compras web, control de SO, tareas de bases de datos y flujos de trabajo guiados por políticas. Inspeccionaron manualmente cada trayectoria fallida y clasificaron el cuello de botella.

La taxonomía es llamativa. Los fallos se dividen en cuatro categorías, y ninguna es “el modelo no sabe suficiente”:

Fallos de Realización de Acciones (34%) — La intención del modelo es plausible, pero la acción no es ejecutable: texto libre en lugar de una llamada a herramienta, argumentos faltantes, tipos de argumentos incorrectos.
Desajustes de Contrato del Entorno (27%) — La acción es sintácticamente válida pero viola el protocolo de uso previsto de la herramienta: llamar a herramientas en orden incorrecto, semántica API incorrecta, malinterpretación de formatos de retroalimentación.
Degeneración de Trayectoria (22%) — Las acciones individuales son correctas, pero el episodio cae en repetición, estancamiento o bucles de recuperación ineficaces.
Fallos de Razonamiento Residual (17%) — Inferencia o toma de decisiones incorrecta a pesar de seguir el protocolo correctamente.

La distribución varía dramáticamente por entorno — en ALFWorld domina la realización de acciones; en los dominios de políticas de τ-bench, los desajustes de contrato son el cuello de botella principal. Pero en general, el 83% de los fallos son problemas de interfaz, no déficits de razonamiento.

2. Life-Harness: Four Layers of Runtime Intervention

2. Life-Harness: Cuatro Capas de Intervención en Runtime

Life-Harness is the system they built around this diagnosis. It wraps the frozen LLM with four lifecycle-aware layers, each targeting one failure category. The model weights never change. The environment never changes. What changes is how the model interfaces with the environment.

Life-Harness es el sistema que construyeron alrededor de este diagnóstico. Envuelve el LLM congelado con cuatro capas conscientes del ciclo de vida, cada una apuntando a una categoría de fallo. Los pesos del modelo nunca cambian. El entorno nunca cambia. Lo que cambia es cómo el modelo interactúa con el entorno.

Layer 1: Environment Contract Layer

Capa 1: Capa de Contrato del Entorno

Target: Contract mismatches. Stage: Before interaction.

The model sees tool descriptions and API docs. These are often generic — written for humans, not for LLMs. The Environment Contract Layer rewrites them to be agent-executable: which tools to prefer, admissible argument ranges, ordering constraints, common pitfalls.

This isn’t prompt engineering. It’s making implicit environmental structure explicit at the point where the model reads it. The enhanced contract C' = C ⊕ Δ_C is derived from training trajectories where agents repeatedly misinterpreted the original contract.

Example: In τ-bench’s retail domain, the tool transfer_to has an argument type that agents kept setting to values the environment would reject. The contract layer adds: “type must be one of [‘wire’, ‘ach’, ‘internal’]. Wire transfers require manager_approval first.”

Objetivo: Desajustes de contrato. Etapa: Antes de la interacción.

El modelo ve descripciones de herramientas y documentación de API. Estas son a menudo genéricas — escritas para humanos, no para LLMs. La Capa de Contrato del Entorno las reescribe para que sean ejecutables por agentes: qué herramientas preferir, rangos de argumentos admisibles, restricciones de orden, errores comunes.

Esto no es ingeniería de prompts. Es hacer explícita la estructura ambiental implícita en el punto donde el modelo la lee. El contrato mejorado C' = C ⊕ Δ_C se deriva de trayectorias de entrenamiento donde los agentes malinterpretaron repetidamente el contrato original.

Ejemplo: En el dominio minorista de τ-bench, la herramienta transfer_to tiene un argumento type que los agentes seguían estableciendo a valores que el entorno rechazaba. La capa de contrato añade: “type debe ser uno de [‘wire’, ‘ach’, ‘internal’]. Las transferencias wire requieren manager_approval primero.”

Layer 2: Procedural Skill Layer

Capa 2: Capa de Habilidades Procedimentales

Target: General decision errors. Stage: Task conditioning.

From training trajectories, the system distills compact reusable strategies — “skills” — and stores them in a BM25-indexed library. When a new task arrives, the harness retrieves the top-K relevant skills and injects them into the system prompt.

This is analogous to Trace2Skill — it captures how successful trajectories solved specific subproblems, not just what tools they used. The skills are non-parametric guidance: compact enough to fit in context, specific enough to change behavior.

Example: For database tasks in AgentBench, the skill might read: “When asked for the top-N results, first query COUNT() to estimate result size, then use ORDER BY + LIMIT. If the query times out, break it into date-range partitions.”*

Objetivo: Errores generales de decisión. Etapa: Condicionamiento de tarea.

De las trayectorias de entrenamiento, el sistema destila estrategias reutilizables compactas — “habilidades” — y las almacena en una biblioteca indexada por BM25. Cuando llega una nueva tarea, el harness recupera las K habilidades más relevantes y las inyecta en el prompt del sistema.

Esto es análogo a Trace2Skill — captura cómo las trayectorias exitosas resolvieron subproblemas específicos, no solo qué herramientas usaron. Las habilidades son guía no paramétrica: lo suficientemente compactas para caber en contexto, lo suficientemente específicas para cambiar comportamiento.

Ejemplo: Para tareas de bases de datos en AgentBench, la habilidad podría decir: “Cuando se te pidan los N resultados principales, primero consulta COUNT() para estimar el tamaño del resultado, luego usa ORDER BY + LIMIT. Si la consulta se agota, divídela en particiones por rango de fechas.”*

Layer 3: Action Realization Layer

Capa 3: Capa de Realización de Acciones

Target: Action realization failures. Stage: After model output, before environment execution.

This layer is a sandbox validator. It checks every model action against the environment’s tool schema, admissible action set, argument constraints, and task policies before allowing it to reach the environment.

If the action is valid, it passes through. If it has unambiguous interface errors — missing arguments, wrong types, ordering violations — the layer blocks it and returns a structured correction message. The model never sees the environment’s raw error; it sees a targeted fix instruction.

Example: The model outputs get_price(item_id="SKU-42") but the tool expects product_id. The layer blocks execution and returns: “Tool ‘get_price’ expects argument ‘product_id’, not ‘item_id’. Retry with the correct parameter name.”

Objetivo: Fallos de realización de acciones. Etapa: Después de la salida del modelo, antes de la ejecución en el entorno.

Esta capa es un validador sandbox. Verifica cada acción del modelo contra el esquema de herramientas del entorno, el conjunto de acciones admisibles, las restricciones de argumentos y las políticas de la tarea antes de permitir que llegue al entorno.

Si la acción es válida, pasa. Si tiene errores de interfaz inequívocos — argumentos faltantes, tipos incorrectos, violaciones de orden — la capa la bloquea y devuelve un mensaje de corrección estructurado. El modelo nunca ve el error crudo del entorno; ve una instrucción de corrección específica.

Ejemplo: El modelo produce get_price(item_id="SKU-42") pero la herramienta espera product_id. La capa bloquea la ejecución y devuelve: “La herramienta ‘get_price’ espera el argumento ‘product_id’, no ‘item_id’. Reintenta con el nombre de parámetro correcto.”

Layer 4: Trajectory Regulation Layer

Capa 4: Capa de Regulación de Trayectoria

Target: Trajectory degeneration. Stage: After environment feedback.

This layer monitors the interaction pattern for self-reinforcing failures: repeating the same invalid command, cycling between equivalent states, exhausting the step budget without progress. These patterns are syntactically detectable — you don’t need deep semantics to spot them.

When degeneration is detected, the layer injects a recovery message. Depending on severity, this ranges from a soft suggestion (“You’ve tried this action twice with the same result — consider a different approach”) to a hard reset directive (“The current strategy is not making progress. Stop and re-read the task description.”).

Example: The agent calls list_products(page=1) five times in a row, getting the same results. The regulation layer injects: “Duplicate action detected (5 consecutive identical calls). The data hasn’t changed. Use search_product(name=…) with specific terms instead.”

Objetivo: Degeneración de trayectoria. Etapa: Después de la retroalimentación del entorno.

Esta capa monitorea el patrón de interacción en busca de fallos auto-reforzantes: repetir el mismo comando inválido, ciclar entre estados equivalentes, agotar el presupuesto de pasos sin progreso. Estos patrones son detectables sintácticamente — no necesitas semántica profunda para identificarlos.

Cuando se detecta degeneración, la capa inyecta un mensaje de recuperación. Dependiendo de la severidad, esto va desde una sugerencia suave (“Has intentado esta acción dos veces con el mismo resultado — considera un enfoque diferente”) hasta una directiva de reinicio (“La estrategia actual no está progresando. Detente y relee la descripción de la tarea”).

Ejemplo: El agente llama a list_products(page=1) cinco veces seguidas, obteniendo los mismos resultados. La capa de regulación inyecta: “Acción duplicada detectada (5 llamadas idénticas consecutivas). Los datos no han cambiado. Usa search_product(name=…) con términos específicos en su lugar.”

3. The Results: 88.5% Relative Improvement, 18 Models

3. Los Resultados: 88.5% de Mejora Relativa, 18 Modelos

The numbers are worth sitting with. Across 126 model–environment settings (18 models × 7 environments), Life-Harness improves performance in 116 settings. The average relative improvement is 88.5%. No model weights changed. No evaluation environments changed.

But the most important result is the transfer experiment. The harness was evolved entirely from Qwen3-4B-Instruct’s training trajectories. Then applied — unchanged — to 17 other models, including:

Instruction-tuned: Llama 3.1, Qwen 2.5, DeepSeek-V2
Reasoning: DeepSeek-R1, QwQ-32B, Qwen3-Plus
Agent-specialized: xLAM-2-32b-fc-r, xLAM-2-8b-fc-r
Small models: Llama 3.2-3B, Qwen3-1.7B

In every case, the harness transfers and improves performance. This is not model-specific behavior. The harness is capturing environment-side structure — stable properties of the task domain that are independent of which model is looking at it.

The paper also shows complementarity with model training: Life-Harness enables the base Qwen2.5-32B-Instruct to outperform its tool-specialized derivative xLAM-2-32b-fc-r, while also improving xLAM itself further. The gains are orthogonal.

Las cifras merecen atención. En 126 configuraciones modelo–entorno (18 modelos × 7 entornos), Life-Harness mejora el rendimiento en 116 configuraciones. La mejora relativa promedio es del 88.5%. Ningún peso del modelo cambió. Ningún entorno de evaluación cambió.

Pero el resultado más importante es el experimento de transferencia. El harness se evolucionó enteramente a partir de las trayectorias de entrenamiento de Qwen3-4B-Instruct. Luego se aplicó — sin cambios — a otros 17 modelos, incluyendo:

Instruction-tuned: Llama 3.1, Qwen 2.5, DeepSeek-V2
Razonamiento: DeepSeek-R1, QwQ-32B, Qwen3-Plus
Especializados en agentes: xLAM-2-32b-fc-r, xLAM-2-8b-fc-r
Modelos pequeños: Llama 3.2-3B, Qwen3-1.7B

En todos los casos, el harness se transfiere y mejora el rendimiento. Esto no es comportamiento específico del modelo. El harness está capturando estructura del lado del entorno — propiedades estables del dominio de la tarea que son independientes de qué modelo las está observando.

El artículo también muestra complementariedad con el entrenamiento de modelos: Life-Harness permite que Qwen2.5-32B-Instruct base supere a su derivado especializado en herramientas xLAM-2-32b-fc-r, mientras que también mejora a xLAM mismo. Las ganancias son ortogonales.

4. Why This Matters for How We Build Agent Systems

4. Por Qué Esto Importa Para Cómo Construimos Sistemas de Agentes

The implicit assumption behind most agentic AI infrastructure is that the model is the bottleneck. Better models → better agents. This paper provides strong evidence that the bottleneck is often elsewhere.

Consider the asymmetry: fine-tuning a model requires GPUs, data pipelines, evaluation suites, checkpoint management. It’s model-specific, so you repeat it for every model family. A harness intervention requires diagnosing trajectory failures and writing interface rules. It’s environment-specific — so you do it once per domain and reuse across models.

This aligns with a pattern we’ve seen repeatedly in agent architecture: the value is in the substrate, not the model. Just as DSPy separates the program from the prompt (letting optimizers find the right instructions automatically), Life-Harness separates the interface from the model (letting trajectory analysis find the right harness interventions automatically).

The parallel with our formal evolution experiment is direct. In Lab 12, the meta-agent didn’t get better because we fine-tuned it — it got better because we added Z3, ArXiv, and OpenRouter tools through MCP configuration. The model stayed the same. The interface changed. And the agent became capable of formal verification, distributed consensus, and self-evaluating distillation — capabilities that had nothing to do with its weights.

Life-Harness makes this principle systematic. Instead of manually curating tool descriptions and recovery strategies, it evolves them from training trajectories. The harness is not hand-crafted — it’s compiled from failure data.

La suposición implícita detrás de la mayor parte de la infraestructura de IA agéntica es que el modelo es el cuello de botella. Mejores modelos → mejores agentes. Este artículo proporciona evidencia sólida de que el cuello de botella está a menudo en otro lugar.

Considera la asimetría: hacer fine-tuning de un modelo requiere GPUs, pipelines de datos, suites de evaluación, gestión de checkpoints. Es específico del modelo, así que lo repites para cada familia de modelos. Una intervención de harness requiere diagnosticar fallos de trayectoria y escribir reglas de interfaz. Es específica del entorno — así que la haces una vez por dominio y la reutilizas entre modelos.

Esto se alinea con un patrón que hemos visto repetidamente en arquitectura de agentes: el valor está en el sustrato, no en el modelo. Así como DSPy separa el programa del prompt (dejando que los optimizadores encuentren las instrucciones correctas automáticamente), Life-Harness separa la interfaz del modelo (dejando que el análisis de trayectorias encuentre las intervenciones de harness correctas automáticamente).

El paralelo con nuestro experimento de evolución formal es directo. En el Lab 12, el meta-agente no mejoró porque hicimos fine-tuning — mejoró porque añadimos herramientas Z3, ArXiv y OpenRouter a través de configuración MCP. El modelo permaneció igual. La interfaz cambió. Y el agente se volvió capaz de verificación formal, consenso distribuido y destilación auto-evaluativa — capacidades que no tenían nada que ver con sus pesos.

Life-Harness hace que este principio sea sistemático. En lugar de curar manualmente descripciones de herramientas y estrategias de recuperación, las evoluciona a partir de trayectorias de entrenamiento. El harness no está diseñado a mano — se compila a partir de datos de fallos.

5. The Open Question: Where Does Harness Evolution End?

5. La Pregunta Abierta: ¿Dónde Termina la Evolución del Harness?

Life-Harness harnesses are evolved from training trajectories and then frozen during evaluation. The authors are careful about this: they don’t adapt during held-out tasks. But if interface adaptation works so well, why stop at freeze time?

The trajectory regulation layer already does online detection of degeneration. Extending this to online construction of new harness rules — learning from evaluation failures in real time — is the natural next step. The paper’s evaluation, which shows that harnesses transfer across models but are environment-specific, suggests that the right unit of adaptation isn’t the model checkpoint or the task instance — it’s the environment family.

This has implications for how we structure agent deployments. If the harness is the primary lever for deterministic task performance, then the engineering effort should shift: less time on model training pipelines, more time on:

Failure diagnosis infrastructure — automated trajectory analysis that classifies interaction failures
Harness design tooling — systems that compile training failures into interface rules
Cross-environment harness evaluation — understanding which harness interventions transfer to which domains

For the agentic systems we’re building — DSPy-compiled pipelines, Dapr-orchestrated workflows, meta-agent architectures with MCP tool bridges — this paper provides a concrete argument for where to invest. The model is a commodity. The interface is the system.

Los harnesses de Life-Harness se evolucionan a partir de trayectorias de entrenamiento y luego se congelan durante la evaluación. Los autores son cuidadosos al respecto: no se adaptan durante las tareas de evaluación. Pero si la adaptación de interfaz funciona tan bien, ¿por qué detenerse en el momento de congelación?

La capa de regulación de trayectoria ya hace detección en línea de degeneración. Extender esto a la construcción en línea de nuevas reglas de harness — aprendiendo de fallos de evaluación en tiempo real — es el siguiente paso natural. La evaluación del artículo, que muestra que los harnesses se transfieren entre modelos pero son específicos del entorno, sugiere que la unidad correcta de adaptación no es el checkpoint del modelo o la instancia de tarea — es la familia de entornos.

Esto tiene implicaciones para cómo estructuramos los despliegues de agentes. Si el harness es la palanca principal para el rendimiento en tareas deterministas, entonces el esfuerzo de ingeniería debería cambiar: menos tiempo en pipelines de entrenamiento de modelos, más tiempo en:

Infraestructura de diagnóstico de fallos — análisis automatizado de trayectorias que clasifique fallos de interacción
Herramientas de diseño de harness — sistemas que compilen fallos de entrenamiento en reglas de interfaz
Evaluación de harness entre entornos — entender qué intervenciones de harness se transfieren a qué dominios

Para los sistemas agénticos que estamos construyendo — pipelines compilados con DSPy, workflows orquestados con Dapr, arquitecturas de meta-agente con bridges MCP — este artículo proporciona un argumento concreto sobre dónde invertir. El modelo es un commodity. La interfaz es el sistema.

References Referencias

Xu, T., Wen, H., & Li, M. (2026). Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents. arXiv:2605.22166. arxiv.org/abs/2605.22166
Life-Harness Code. github.com/Tianshi-Xu/Life-Harness
Yao, S., et al. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arxiv.org/abs/2406.12045
Barres, V., et al. (2025). τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment.
Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024. arxiv.org/abs/2308.03688
Chen, X., et al. (2026). Learning to Self-Evolve: LSE. arXiv:2603.18620. arxiv.org/abs/2603.18620
Ni, J., et al. (2026). Trace2Skill: Towards Generalizable Skills from Agent Trajectories. arXiv:2603.25158. arxiv.org/abs/2603.25158
Lin, R., et al. (2026). Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses. AHE.
Lee, S., et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses.
Sengupta, S., & Wang, R. (2026). HARBOR: Automated Harness Optimization.

Xu, T., Wen, H., & Li, M. (2026). Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents. arXiv:2605.22166. arxiv.org/abs/2605.22166
Código de Life-Harness. github.com/Tianshi-Xu/Life-Harness
Yao, S., et al. (2024). τ-bench: Un Benchmark para Interacción Herramienta-Agente-Usuario en Dominios del Mundo Real. arxiv.org/abs/2406.12045
Barres, V., et al. (2025). τ²-Bench: Evaluando Agentes Conversacionales en un Entorno de Doble Control.
Liu, X., et al. (2024). AgentBench: Evaluando LLMs como Agentes. ICLR 2024. arxiv.org/abs/2308.03688
Chen, X., et al. (2026). Learning to Self-Evolve: LSE. arXiv:2603.18620. arxiv.org/abs/2603.18620
Ni, J., et al. (2026). Trace2Skill: Habilidades Generalizables desde Trayectorias de Agentes. arXiv:2603.25158. arxiv.org/abs/2603.25158
Lin, R., et al. (2026). Agentic Harness Engineering: Evolución Automática de Harnesses para Agentes de Código. AHE.
Lee, S., et al. (2026). Meta-Harness: Optimización Integral de Harnesses de Modelos.
Sengupta, S., & Wang, R. (2026). HARBOR: Optimización Automatizada de Harnesses.