Enterprise AI teams have spent the past two years treating model selection as the primary performance lever. The assumption has been that upgrading to a more capable frontier model will resolve most quality and reliability problems. That assumption is increasingly wrong. The configuration layer surrounding the model - the system prompts, tool definitions, routing logic, retry policies, and output constraints collectively known as the agent harness - now determines a larger share of production performance than the model weights themselves. For organisations running custom LLM-based agents at scale, this means harness engineering has moved from a one-time setup task to an ongoing infrastructure discipline.
Why the Harness Matters More Than It Used to
When agents were simple prompt-response systems, the harness was thin: a system prompt, perhaps a few examples, and a temperature setting. Modern production agents are different in kind. They execute multi-step reasoning chains, call external tools, maintain state across turns, and branch conditionally based on intermediate outputs. Each of those steps introduces a surface where poorly specified operating rules degrade performance.
The compounding effect is significant. A harness that loses two percentage points of task success at each of five sequential steps produces a final success rate roughly 10 percentage points below what the model could theoretically achieve. That gap does not appear in benchmark evaluations run against single-turn prompts. It only becomes visible in production traces.
What Execution Traces Reveal
Every agent run generates a trace: the sequence of model calls, tool invocations, intermediate outputs, and final results that constitute a single task execution. Most teams treat these traces as debugging artefacts. They are more usefully understood as a structured signal about where the harness is failing.
Traces expose specific failure modes that model evaluations miss. A harness may instruct the agent to verify a calculation before proceeding, but traces may show that the verification step is consistently skipped when the model is confident. That is not a model capability problem. It is a specification problem in the operating rules, and it is diagnosable from trace data without any model retraining.
The commercial implication is direct. Organisations that build trace analysis into their agent operations pipeline gain a systematic basis for harness improvement. Those that do not are left with anecdotal debugging, which does not scale across hundreds of agent variants or millions of monthly executions.
The Self-Improvement Mechanism
The emerging approach to harness optimisation treats the operating rules as a variable to be updated from execution data, not a fixed configuration to be manually revised. The process works in three phases.
Trace Collection and Failure Attribution
First, traces from production or evaluation runs are collected and labelled by outcome. Successful completions and failures are distinguished, and the failure cases are analysed to identify which step in the execution sequence caused the breakdown. This attribution step is non-trivial: a failure at step five may originate from an ambiguous instruction at step two.
Rule Rewriting
Second, a separate model or optimisation process proposes revisions to the operating rules that would have prevented the attributed failures. This is not fine-tuning the base model. It is rewriting the harness instructions based on evidence from prior runs. The distinction matters because harness updates are fast, cheap, and reversible. Model fine-tuning is none of those things.
Evaluation and Deployment
Third, the revised harness is evaluated against a held-out task set before deployment. This creates a feedback loop in which the harness improves incrementally with each cycle. Research in this area has reported task performance gains of up to 60% relative to the initial harness configuration, achieved without any change to the underlying model.
The Infrastructure Requirements
Running this loop in production requires infrastructure that most enterprise AI teams have not yet built. Trace storage needs to be structured, queryable, and linked to outcome labels. The rewriting process needs a controlled environment where candidate harnesses can be tested without affecting production traffic. Version control for harness configurations needs to be as rigorous as version control for application code.
These are not novel engineering problems. They are applications of existing MLOps patterns to a new artefact type. The gap is organisational: teams that have invested in model deployment infrastructure have not yet extended that investment to harness lifecycle management.
Failure Modes at Scale
Without a systematic improvement loop, harness quality degrades relative to task complexity over time. New use cases are added to existing agents, edge cases accumulate, and the original operating rules become an increasingly poor fit for the actual workload. Teams respond by adding patches to the system prompt, which introduces contradictions and increases token cost without reliably improving performance.
A second failure mode is harness divergence across agent variants. Large deployments often run dozens of specialised agents, each with a separately maintained harness. Without a shared improvement process, the quality gap between the best-maintained and worst-maintained agents widens. This creates uneven user experience and makes aggregate performance reporting unreliable.
Regulatory and Audit Implications
In regulated industries, the harness is also a compliance artefact. The operating rules define what the agent is permitted to do, what it must refuse, and how it handles sensitive data categories. If those rules are updated manually and informally, the audit trail is incomplete. Automated harness versioning with linked evaluation results provides a defensible record of what rules were in effect at any point in time and what evidence supported each revision.
Financial services and healthcare deployments face the most immediate pressure here. Regulators examining AI-assisted decisions will ask how operating rules are maintained and validated. An organisation that can produce a versioned harness history with associated evaluation metrics is in a materially better position than one that cannot.
Planning the Transition
The practical starting point is trace infrastructure, not the optimisation loop itself. Teams that do not yet have structured, outcome-labelled trace storage cannot run systematic harness analysis regardless of what optimisation tooling they adopt. Building that foundation first is the correct sequencing.
Once trace infrastructure is in place, the next investment is evaluation harnesses: standardised task sets that can be run against candidate harness configurations in a controlled environment. These serve double duty as regression tests when model versions change and as benchmarks for harness improvement cycles.
The optimisation loop itself can then be introduced incrementally, starting with the highest-volume agents where trace data is most abundant and performance gains have the largest revenue or cost impact. That prioritisation also limits the blast radius of any early-stage process failures.
FAQs
A system prompt is one component of an agent harness. The harness is the full configuration layer that governs agent behaviour: system prompts, tool definitions, output format constraints, retry logic, routing rules between sub-agents, and any structured instructions that shape how the model executes a task. In simple single-turn applications, the harness and the system prompt are nearly equivalent. In multi-step agentic workflows, the harness is a substantially more complex artefact with many more points of failure.
Fine-tuning modifies the model weights to change how the model responds to inputs. Harness improvement modifies the operating rules that govern what the model is asked to do and how its outputs are handled. Fine-tuning requires significant compute, a labelled training dataset, and a redeployment cycle that typically takes days to weeks. Harness updates are text changes that can be tested and deployed in hours. The two approaches address different failure modes: fine-tuning is appropriate when the model lacks a capability, harness improvement is appropriate when the model has the capability but the operating rules are not eliciting it correctly.
The figure refers to task success rate improvement relative to an initial, manually configured harness baseline. It reflects the upper range reported in research on systematic harness optimisation and is not a guaranteed outcome for every deployment. The gain is largest when the initial harness is poorly specified, the task set is well-defined and measurable, and trace volume is sufficient to support meaningful analysis. Deployments with already-mature harnesses, ambiguous success criteria, or low trace volumes will see smaller gains. The more useful framing for planning purposes is that systematic harness optimisation consistently outperforms ad hoc manual tuning, and the performance gap widens as task complexity increases.
There is no universal threshold, but the practical floor is determined by the statistical requirement that failure cases be numerous enough to distinguish systematic harness problems from random model variance. For a binary success/failure task, several hundred labelled traces per agent variant is a reasonable starting point. Below that volume, the failure attribution step becomes unreliable and proposed harness revisions may overfit to noise. High-volume production agents typically accumulate sufficient trace data within days. Lower-volume agents may require pooling traces across similar task types or running targeted evaluation runs to supplement production data.
Harness configurations should be stored in version control with the same discipline applied to application code: each version tagged, changes reviewed before merge, and rollback paths maintained. The additional requirement specific to harnesses is linking each version to the evaluation results that justified the change. This creates an auditable record of what operating rules were in effect at any point and what evidence supported each revision. For regulated industries, this record is not optional. For all deployments, it is the mechanism that prevents harness drift and makes regression testing tractable when the underlying model is updated.
The two decisions address different constraints. Model upgrades improve the ceiling of what the agent can do. Harness optimisation closes the gap between that ceiling and what the agent actually does in production. If failure analysis shows that the model is producing correct intermediate outputs but the harness is not using them correctly, a model upgrade will not help. If the model is consistently failing on tasks that are within its documented capability range, harness improvement is the more efficient intervention. In practice, the most common situation is that both the harness and the model have room to improve, and harness work should be prioritised first because it is faster, cheaper, and provides cleaner signal about whether a model upgrade would add further value.

