The central reliability problem in production agent deployment is not whether an agent can complete a task, but whether you can detect trajectory failure before it propagates into irreversible state. Outcome-level evaluation, which scores a trajectory only after it terminates, provides no signal during execution and offers no basis for intervention. Step-level evaluation has long been the theoretical answer, but the standard approach, training a dedicated process reward model, has not transferred to the agentic setting at production scale. Recent work on progress advantage offers a different path, one grounded in the RL post-training pipeline that most teams are already running, and its implications for agent monitoring and failure attribution are worth examining carefully.
Why Process Reward Models Have Not Scaled to Agentic Settings
Process reward models were developed primarily in the context of mathematical reasoning, where trajectories are short, actions are discrete, and human annotators can evaluate intermediate steps with reasonable consistency. Agentic tasks do not share these properties. A web navigation or code execution trajectory may span hundreds of tool calls, each with environment feedback that is stochastic and partially observable.
Annotation at this scale is not economically viable. A single annotated trajectory for a complex multi-step task requires an annotator to understand the full task context, evaluate whether each intermediate action was correct given the state at that point, and do so consistently across thousands of examples. Monte Carlo estimation, the alternative to human annotation, requires rolling out many completions from each intermediate state to estimate the expected outcome, which is computationally prohibitive for long-horizon tasks with real environment interactions.
The result is that most production agent systems operate without any step-level scoring at all. Monitoring is either outcome-based, detecting failures only after task termination, or heuristic, relying on timeout thresholds, error code detection, or token count anomalies that do not reflect actual trajectory quality.
What the Log-Probability Ratio Recovers
The core theoretical result in progress advantage is that, under a general stochastic Markov decision process formulation, the log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function at each step (Oh et al., arXiv 2026). This is not an approximation or a heuristic. It is a derivable identity from the structure of RL post-training.
The practical consequence is significant. If you have run RL post-training on a base model, you already have both components: the trained policy and the reference policy from which it diverged. Computing the log-probability ratio at each step of a trajectory requires no additional training, no annotation pipeline, and no separate reward model. The signal is a byproduct of the training infrastructure you have already built.
This distinguishes progress advantage from confidence-based baselines such as token-level entropy or sequence probability, which measure model uncertainty rather than task progress. A model can be highly confident in a step that is directionally wrong, and highly uncertain about a step that is correct but unusual. Progress advantage, by measuring the divergence between trained and reference policy, captures whether the trained policy has learned to prefer this step in the context of task completion.
Domain-Agnostic Signal Without Task-Specific Training
One practical constraint of trained process reward models is that they are task-specific. A PRM trained on mathematical reasoning does not transfer to tool-use trajectories, and a PRM trained on web navigation does not transfer to code execution. Each new task domain requires a new annotation effort and a new training run.
Progress advantage carries no such constraint. Because the signal is derived from the policy divergence rather than from domain-specific labels, it applies wherever the RL-trained policy has been deployed. Oh et al. (arXiv 2026) validate this across five benchmarks spanning different task types and four model families, finding consistent performance gains over confidence-based baselines and, notably, over dedicated trained reward models, despite requiring no task-specific training.
This matters operationally. Teams managing agents across multiple task domains, which is the common production configuration, do not need to maintain a separate PRM for each domain. The monitoring signal scales with the policy, not with the annotation budget.
Implications for Uncertainty Quantification
Uncertainty quantification in agent systems has typically relied on ensemble methods or token-level probability distributions, both of which are expensive or poorly calibrated for long-horizon tasks. Progress advantage provides a step-level signal that can be used to estimate trajectory-level uncertainty by aggregating intermediate scores across steps.
A trajectory where progress advantage scores are consistently positive indicates that the trained policy is systematically preferring the chosen actions over what the reference policy would have selected, which is evidence of task-directed progress. A trajectory where scores are mixed or declining indicates that the policy is reverting toward reference-policy behavior, which is a signal that the agent is losing its learned task orientation.
This structure makes progress advantage useful as an input to uncertainty estimation without requiring calibration against held-out outcome labels. The signal is relative, comparing trained to reference policy, so it does not require an absolute threshold to be meaningful within a trajectory.
Failure Attribution Over Long-Horizon Trajectories
Credit assignment in long-horizon trajectories is one of the hardest problems in agent observability. When a task fails, identifying which step introduced the failure requires either replaying the trajectory with counterfactual actions or having a step-level score that correlates with downstream outcome. Without such a signal, post-failure analysis is essentially manual inspection of the full trajectory log.
Progress advantage provides a structured basis for failure attribution. Steps where the log-probability ratio drops sharply, indicating the trained policy assigned much lower probability to the chosen action than the reference policy, are candidates for the point at which the trajectory began to diverge from task-directed behavior. This does not identify the root cause automatically, but it narrows the search space from the full trajectory to a small number of candidate steps.
For teams building agent observability infrastructure, this has a direct engineering implication. Storing progress advantage scores alongside trajectory logs, at inference time, enables post-hoc analysis of failed trajectories without requiring full trajectory replay. The score is computed during the forward pass and adds negligible overhead if the reference policy logits are available.
Intervention Checkpoint Design
The most direct production application of progress advantage is in designing intervention checkpoints: points in a trajectory where the system pauses, escalates, or replans based on a signal that the current trajectory is at risk. Without a step-level score, intervention checkpoints are either time-based, triggering after a fixed number of steps, or error-based, triggering on explicit failure signals from the environment.
Progress advantage enables score-based checkpoints. A threshold on the cumulative or rolling progress advantage score can trigger a replanning step before the environment returns an explicit failure signal. This is the difference between detecting that an agent is about to fail and detecting that it has failed, and the operational value of that distinction is substantial in settings where actions are partially irreversible, such as file system modifications, API calls with side effects, or database writes.
The design of these thresholds requires calibration against task-specific outcome data, but the signal itself is available without additional training. Teams can begin collecting progress advantage scores during inference immediately after RL post-training and use the resulting dataset to set empirically grounded intervention thresholds.
Connecting to Production Observability Architecture
Instrumenting progress advantage into an existing agent system requires access to reference policy logits at inference time. For teams that have run RL post-training using standard frameworks, the reference policy is typically retained as a frozen checkpoint used for KL divergence regularization during training. Deploying it alongside the trained policy for inference-time scoring adds memory overhead but no additional training cost.
The monitoring architecture then becomes: compute progress advantage at each step, store scores in the trajectory log, define intervention thresholds based on calibrated outcome data, and surface anomalous score patterns to the observability layer. This integrates naturally with session-centered agent architectures where trajectory state is managed explicitly, as we have described in our prior work on agent runtime state management (see Agent Runtime State: The Hidden Liability in Multi-Agent Systems).
The deeper point is that progress advantage does not require a new monitoring paradigm. It is a signal that fits into existing observability infrastructure, provided that infrastructure is designed to capture and store per-step metadata rather than only terminal outcomes. Teams that have already invested in step-level logging have the lower implementation cost; those that log only outcomes will need to extend their trajectory capture before they can use the signal.
FAQs
Yes, computing the log-probability ratio requires forward passes through both the trained policy and the reference policy. In practice, this means deploying two copies of the base model, or at minimum retaining the reference policy weights alongside the fine-tuned weights. For a 7B parameter model in half-precision, that is approximately 14 GB of additional GPU memory. For larger models, the cost scales proportionally. Whether this overhead is acceptable depends on the value of the step-level signal relative to the cost of false negatives in your specific deployment context.
The derivation in Oh et al. (arXiv 2026) is formulated under a general stochastic Markov decision process, which means it is not tied to a specific RL algorithm. It applies wherever RL post-training has produced a trained policy that diverges from a reference policy, which covers GRPO, PPO, and similar KL-regularised training methods. The key requirement is that a reference policy exists and is accessible, which is standard in most RL post-training pipelines that use KL divergence as a regularisation term.
Threshold calibration requires a dataset of completed trajectories with known outcomes, labelled by success or failure. Progress advantage scores at each step can then be correlated with downstream outcomes to identify score ranges that reliably predict failure. This calibration is task-specific in practice, even though the signal itself is domain-agnostic, because the distribution of advantage scores varies with task complexity and trajectory length. Teams should treat threshold calibration as an ongoing operational process rather than a one-time setup, updating thresholds as the agent's task distribution shifts.
No. The signal depends on the divergence between a trained policy and a reference policy that was produced by RL post-training. If an agent was trained using supervised fine-tuning only, there is no meaningful reference policy in the RL sense, and the log-probability ratio does not recover the advantage function. Teams using SFT-only agents would need to run RL post-training to access the signal, which is a non-trivial additional step. For agents already on an RL post-training pipeline, the signal is available at no additional training cost.
Token-level entropy measures model uncertainty about the next token, which is a property of the model's output distribution rather than a measure of task progress. A model can be highly certain about a token that moves the trajectory in the wrong direction, and uncertain about a token that is correct but unusual in context. Progress advantage measures whether the trained policy prefers the chosen action over what the reference policy would have selected, which is a more direct proxy for task-directed behavior. Oh et al. (arXiv 2026) show empirically that progress advantage consistently outperforms confidence-based baselines, including entropy-based measures, across the benchmarks they evaluate.
The primary failure mode is that the signal reflects policy divergence from the reference, not ground-truth task correctness. If the RL training process itself was poorly specified, for example due to reward hacking or distributional shift between training and deployment, the trained policy may have learned to prefer actions that score highly on the training reward but fail in deployment. In that case, progress advantage scores would be positive for steps that are actually harmful. The signal is only as reliable as the RL training was well-specified. A secondary failure mode is that very long trajectories may accumulate score drift unrelated to the current step, which argues for using rolling window aggregation rather than cumulative scores for intervention threshold comparisons.

