AI Strategy , Data science & AI , Software development Jun 29, 2026

Training LLMs Without Ground-Truth Labels: Where Reward-Free Reinforcement Learning Is Now Viable

VECTOR Labs Team

Last updated on: Jun 29, 2026

The dominant assumption in reinforcement learning fine-tuning for LLMs is that a verifiable correct answer must exist before a reward signal can be assigned. This assumption is operationally reasonable for mathematics and competitive programming, where outputs can be checked against known solutions, but it excludes a substantial class of enterprise optimisation problems where correctness is not binary and no gold-standard output exists. Recent work on the RiVER framework (Lin et al., arXiv 2026) demonstrates that this constraint is not fundamental: with appropriate reward shaping, deterministic execution feedback from score-based environments can substitute for ground-truth labels and produce training improvements that generalise beyond the training distribution. The implications for teams designing fine-tuning pipelines around heuristic or multi-objective tasks are concrete.

Why Ground-Truth Dependency Has Constrained RL Fine-Tuning

Reinforcement Learning with Verifiable Rewards (RLVR) has produced strong post-training results in settings where the verification function is exact. A model generates an answer, the answer is compared to a known solution, and a binary or near-binary reward is returned. This works well for tasks with unambiguous outputs, but it creates a structural exclusion: any task where the "best" output is defined by a scoring function rather than a known answer cannot be handled within this paradigm without synthetic labelling or human annotation, both of which introduce cost and coverage limitations.

The practical consequence is that many enterprise optimisation tasks, including scheduling, resource allocation, configuration search, and combinatorial planning, have been treated as outside the scope of RL-based fine-tuning. These tasks often have execution environments that can score outputs deterministically. The scoring function exists; the missing component has been a training methodology that can use continuous scores as a reliable reward signal without the distortions that naive application of group-relative RL introduces.

The Two Failure Modes in Continuous Reward RL

Scale Dominance

When group-relative policy optimisation is applied directly to raw execution scores across a batch of training instances, instances with larger absolute score ranges exert disproportionate influence on gradient updates. A task where scores vary between 0 and 10,000 will dominate weight updates over a task where scores vary between 0 and 1, even if the model's relative performance across candidates is similar in both cases. Lin et al. (arXiv 2026) identify this as scale dominance, and it causes policy updates to reflect score magnitude rather than solution quality.

The mechanism is straightforward: group-relative advantage estimation normalises within a group of samples drawn from a single instance, but when instances are batched together without calibration, the raw score distributions are not comparable. The commercial implication is that training runs on mixed-difficulty or mixed-scale task sets will be unstable and may converge to policies that perform well on high-magnitude instances while degrading on lower-magnitude ones.

Frequency Dominance

The second failure mode arises from sampling dynamics rather than score calibration. When a model repeatedly generates suboptimal solutions for a given instance during training, those suboptimal samples accumulate in the training batch and collectively receive more gradient weight than the rare high-quality candidates. Lin et al. (arXiv 2026) term this frequency dominance: the policy is updated more strongly in the direction of common mediocre outputs than toward the infrequent strong solutions.

This is particularly relevant in tasks with sparse high-quality regions in the output space, which is characteristic of combinatorial optimisation. A model that occasionally generates a near-optimal schedule will have that signal diluted by the many average schedules it also generates. Without explicit mechanisms to upweight top-ranked outputs, the training signal systematically undervalues the most informative samples.

How RiVER Addresses Both Failure Modes

RiVER applies instance-wise score calibration to address scale dominance: rather than using raw scores, rewards are derived from within-instance rankings across a group of sampled solutions. This makes the reward signal invariant to the absolute score range of any given task instance and ensures that gradient updates reflect relative solution quality rather than score magnitude (Lin et al., arXiv 2026).

To counter frequency dominance, the framework applies asymmetric reward shaping that assigns amplified positive rewards to top-ranked solutions within each group while retaining bounded, non-zero feedback for other valid solutions. This preserves a learning signal across the full sample distribution without allowing the majority of average outputs to dominate the update. The result is a training regime where the policy is consistently pulled toward the best observed outputs rather than toward the mean.

The empirical results from training on 12 AtCoder Heuristic Contest tasks are informative on both counts. Qwen3-8B and GLM-Z1-9B-0414 improved by 8.9% and 9.4% respectively on the ALE rating benchmark. More consequentially, both models also improved on exact-solution benchmarks, LiveCodeBench and USACO, by an absolute average of 2.4% and 3.5%, despite receiving no ground-truth supervision during training (Lin et al., arXiv 2026). Baselines trained with uncalibrated raw scores improved on ALE but showed no transfer to exact-solution tasks, which confirms that the calibration methodology is doing substantive work beyond simply exposing the model to more training compute.

Enterprise Task Categories Where This Methodology Is Applicable

The practical scope of reward-free RL extends to any domain where a deterministic scoring function can evaluate outputs without reference to a known correct answer. The scoring function must be executable, return a continuous or ordinal value, and be consistent across samples drawn from the same instance.

Supply chain scheduling, workforce allocation, and infrastructure configuration each meet these criteria. A scheduling solution can be scored on constraint satisfaction rate and total cost without knowing the globally optimal schedule. A network configuration can be evaluated on latency and packet loss without a reference configuration to compare against. In each case, the execution environment provides the reward signal; the missing component has been a training framework that uses that signal without introducing the distortions that Lin et al. (arXiv 2026) identify.

Multi-objective document generation is a related category. Tasks such as regulatory filing drafts, procurement specifications, or clinical trial protocol sections are typically evaluated against rubrics with weighted criteria rather than against a gold-standard document. If those rubrics can be formalised into a scoring function, the same training methodology applies. The constraint is formalisation: the scoring function must be deterministic and executable, not a post-hoc human judgement applied after training.

Practical Constraints and Deployment Considerations

The RiVER results were obtained on models in the 8-9 billion parameter range trained on a specific category of heuristic optimisation tasks. Generalisation to substantially different task domains or model scales has not been established in the published work, and teams should treat the reported transfer improvements as indicative rather than guaranteed. The quality of the scoring function is a direct ceiling on training quality: a scoring function that does not capture the true objective will shape the policy toward the proxy metric, not the intended outcome.

Reward function design therefore becomes a first-order engineering concern rather than a secondary consideration. Teams adopting this methodology will need to invest in formalising their scoring criteria with the same rigour applied to training data curation in supervised settings. An imprecisely specified multi-objective scorer will produce a model optimised for the wrong trade-off, and the absence of ground-truth labels means that this misalignment may not surface until the model is evaluated on held-out tasks or deployed against real workloads.

The computational overhead of group-relative sampling, where multiple candidate solutions are generated per instance per training step, is also a practical factor. Training throughput will be lower than equivalent supervised fine-tuning runs, and infrastructure teams should account for this in pipeline planning. The benefit is that annotation cost is eliminated entirely for tasks with executable scoring environments, which shifts the resource profile rather than simply adding to it.

FAQs

What types of scoring functions are compatible with the RiVER training approach?

The scoring function must be deterministic and executable at training time, meaning it must return a consistent numerical score for a given output without requiring human evaluation. Constraint satisfaction scores, cost functions, latency measurements, and weighted rubric evaluations that can be computed programmatically all qualify. Scoring functions that require subjective human judgement or that are non-deterministic across evaluations are not compatible with this methodology in its current form.

Does training on score-based tasks without ground-truth labels degrade performance on tasks that do have verifiable answers?

Based on the RiVER results, properly calibrated reward shaping does not degrade exact-solution performance and can improve it. Qwen3-8B and GLM-Z1-9B-0414 both showed absolute improvements on LiveCodeBench and USACO after training exclusively on heuristic optimisation tasks with no ground-truth supervision (Lin et al., arXiv 2026). The key distinction is that uncalibrated raw-score training did not transfer, which suggests the calibration methodology is necessary for this generalisation property to hold.

How does scale dominance differ from standard reward normalisation problems in RL?

Standard reward normalisation in RL addresses variance across timesteps or episodes within a single task. Scale dominance in the RiVER context arises across instances within a training batch, where different task instances have fundamentally different score ranges. Group-relative advantage estimation normalises within a group drawn from a single instance, but when those normalised advantages are combined across instances with different raw score magnitudes, the higher-magnitude instances exert disproportionate gradient influence. Instance-wise calibration addresses this by making the reward signal range-invariant before batching.

What is the computational cost difference between this approach and supervised fine-tuning on annotated data?

Group-relative sampling requires generating multiple candidate solutions per training instance per step, which increases forward-pass compute relative to supervised fine-tuning. The number of samples per group is a hyperparameter that directly affects both training signal quality and throughput. Against this, annotation cost is eliminated entirely for tasks with executable scoring environments. For enterprise tasks where annotation would require domain experts, the compute overhead of sampling is typically lower than the time and cost of building and maintaining a labelled dataset at training scale.

Can this methodology be applied to multi-objective tasks where objectives conflict?

Yes, provided the multi-objective trade-off is formalised into a single executable scoring function before training. The scoring function can be a weighted sum, a Pareto-based ranking, or any other aggregation that returns a consistent scalar or ordinal value. The critical constraint is that the weighting of objectives must be fixed prior to training, because the policy will be shaped toward whatever trade-off the scorer encodes. If the intended trade-off changes post-deployment, retraining or at minimum re-evaluation against the new scoring function is required.

At what model scale have these results been validated, and does scale affect applicability?

The published RiVER results cover models in the 8-9 billion parameter range, specifically Qwen3-8B and GLM-Z1-9B-0414 (Lin et al., arXiv 2026). Whether the calibration methodology produces equivalent relative improvements at larger scales, such as 70B or above, has not been established in the current work. Teams considering application to larger models should treat the reported gains as a directional signal and plan for empirical validation at their target scale rather than assuming linear transfer of the improvement magnitudes.

A team that understands you

With 20+ years of experience in the world's leading consultancy companies, implementing AI and ML projects in industry-specific contexts, we are ready to hear your challenges.

Talk with an AI expert