If small models are given longer thinking time, their performance can surpass that of larger models.
Recently, there has been an unprecedented enthusiasm for research on small models in the industry, using some “practical techniques” to allow them to outperform larger models.
It can be said that focusing on improving the performance of smaller models is inevitable. For large language models, the scaling of train-time compute has dominated their development. Although this model has proven to be very effective, the resources required for pre-training increasingly larger models have become extraordinarily expensive, with billion-dollar clusters emerging.
Therefore, this trend has sparked great interest in another complementary approach, namely test-time compute scaling. The test-time method does not rely on an ever-increasing pre-training budget but uses dynamic inference strategies to allow models to “think longer” on more difficult problems. A prominent example is OpenAI’s O1 model, which shows continuous improvement on difficult math problems as the amount of test-time compute increases.
While we are unclear about how O1 was trained, recent research from DeepMind indicates that optimal scaling of test-time compute can be achieved through strategies such as iterative self-improvement or using reward models to search the solution space. By adaptively allocating test-time compute according to prompts, smaller models can compete with larger, resource-intensive models and sometimes even surpass them. This approach is particularly beneficial when memory is constrained and available hardware is insufficient to run larger models. However, this promising method has been demonstrated using closed-source models, with no implementation details or code released.
DeepMind paper: https://arxiv.org/pdf/2408.03314
In recent months, HuggingFace has been delving into reverse engineering these results and reproducing them. They will introduce in this blog post:
-
Compute-Optimal Scaling: Enhancing the mathematical capabilities of open models during testing by implementing DeepMind’s techniques.
-
Diversity Validator Tree Search (DVTS): An extension developed for validator-guided tree search techniques. This simple and efficient method enhances diversity and provides better performance, especially in cases with larger test-time compute budgets.
-
Search and Learn: A lightweight toolkit for implementing search strategies using LLM and enhancing speed with vLLM.
So, how does compute-optimal scaling perform in practice? In the following figure, if you give them enough “thinking time”, the small-scale 1B and 3B Llama Instruct models outperform much larger 8B and 70B models on the challenging MATH-500 benchmark.
HuggingFace co-founder and CEO Clem Delangue stated that just 10 days after OpenAI’s O1 was unveiled, they are excited to reveal the open-source version of the breakthrough technology behind its success: scaling test-time compute. By giving the model longer “thinking time”, the 1B model can beat the 8B, and the 3B model can beat the 70B. Of course, the complete technical recipe is open-source.
Various netizens have reacted to these results with disbelief, calling it incredible and considering it a victory for small models.
Next, HuggingFace delves into the reasons behind the above results and helps readers understand practical strategies for implementing test-time compute scaling.
Scaling Test-Time Compute Strategies
Scaling test-time compute mainly includes the following two main strategies:
-
Self-Improvement: The model iteratively improves its outputs or “thoughts” by identifying and correcting errors in subsequent iterations. While this strategy is effective for certain tasks, it typically requires the model to have a built-in self-improvement mechanism, which may limit its applicability.
-
Search for Validators: This approach focuses on generating multiple candidate answers and using validators to select the best answer. Validators can be based on hard-coded heuristic methods or learned reward models. This article will focus on learned validators, which include techniques such as Best-of-N sampling and tree search. This search strategy is more flexible and can adapt to the difficulty of the problem, but their performance is limited by the quality of the validators.
HuggingFace focuses on search-based methods, which are practical and scalable solutions for optimizing test-time compute. Here are three strategies:
-
Best-of-N: Typically uses a reward model to generate multiple responses for each question and assigns a score to each candidate answer, then selects the one with the highest reward (or a weighted variant discussed later). This method emphasizes answer quality rather than frequency.
-
Beam Search: A systematic search method for exploring the solution space, often combined with a process reward model (PRM) to optimize sampling and evaluation of intermediate steps in problem-solving. Unlike traditional reward models that produce a single score for the final answer, PRM provides a series of scores, with each step of the reasoning process receiving a score. This fine-grained feedback capability makes PRM a natural choice for LLM search methods.
-
Diversity Validator Tree Search (DVTS): HuggingFace’s beam search extension, which splits the initial beam into independent subtrees and then greedily expands these subtrees using PRM. This approach enhances solution diversity and overall performance, especially in cases with larger test-time compute budgets.
Experimental Setup
The experimental setup includes the following steps:
-
First, provide the LLM with a math problem to generate N partial solutions, such as intermediate steps in the derivation process.
-
Each step is scored by PRM, which estimates the probability of reaching the correct answer for each step.
-
Once the search strategy concludes, the final candidate solutions will be ranked by PRM to produce the final answer.
To compare various search strategies, this article used the following open-source models and datasets:
-
Model: Using meta-llama/Llama-3.2-1B-Instruct as the primary model for scaling test-time compute;
-
Process Reward Model PRM: To guide the search strategy, this article used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8 billion reward model trained with process supervision. Process supervision is a training method where the model receives feedback at each step of the reasoning process, not just the final result;
-
Dataset: This article evaluated on the MATH-500 subset, a benchmark dataset released by OpenAI as part of process supervision research. These math problems cover seven subjects and are challenging for both humans and most large language models.
This article will start from a simple baseline and then gradually combine other techniques to improve performance.
Majority Voting
Majority voting is the most straightforward method for aggregating LLM outputs. For a given math problem, N candidate solutions are generated, and the most frequently occurring answer is selected. In all experiments, this article sampled up to N=256 candidate solutions with a temperature parameter T=0.8, generating up to 2048 tokens for each problem.
The following shows the performance of majority voting applied to Llama 3.2 1B Instruct:
The results indicate that majority voting shows significant improvement over the greedy decoding baseline, but its gains begin to plateau after about N=64 generations. This limitation arises because majority voting struggles to resolve problems that require intricate reasoning.
Given the limitations of majority voting, let’s see how to integrate reward models to enhance performance.
Surpassing Majority: Best-of-N
Best-of-N is a simple and effective extension of the majority voting algorithm that uses a reward model to determine the most reasonable answer. There are two main variants of this method:
Ordinary Best-of-N: Generates N independent responses and selects the one with the highest RM reward as the final answer. This ensures the selection of the response with the highest confidence but does not consider the consistency between answers.
Weighted Best-of-N: Aggregates the scores of all identical responses and selects the answer with the highest total reward. This method increases scores through repetition, thus prioritizing high-quality answers. Mathematically, the weight of answer a_i:
where RM (p,s_i) is the reward model score for the i-th solution s_i to problem p.
Typically, people use result reward models (ORM) to obtain scores at the individual solution level. However, to make a fair comparison with other search strategies, the same PRM is used to score the solutions of Best-of-N. As shown in the figure below, PRM generates a cumulative sequence of step-level scores for each solution, so reduction is needed to obtain a single solution-level score:
The most common reductions are as follows:
-
Min: Use the lowest score from all steps.
-
Prod: Use the product of the step scores.
-
Last: Use the final score from the steps. This score contains cumulative information from all previous steps, effectively treating PRM as an ORM capable of scoring partial solutions.
The following shows the results obtained by applying both variants of Best-of-N:
The results reveal a clear advantage: weighted Best-of-N consistently outperforms ordinary Best-of-N, especially in cases with larger generation budgets. It is able to aggregate the scores of identical answers, ensuring that even lower-frequency but higher-quality answers receive effective prioritization.
However, despite these improvements, it still does not reach the performance achieved by the Llama 8B model, and the Best-of-N method begins to stabilize at N=256.
Can we further push the boundaries by gradually supervising the search process?
Beam Search with PRM
As a structured search method, beam search can systematically explore the solution space, making it a powerful tool for improving model outputs at test time. When used in conjunction with PRM, beam search can optimize the generation and evaluation of intermediate steps in problem-solving. The beam search works as follows:
-
By maintaining a fixed number of “beams” or active paths N, iteratively generate multiple candidate solutions.
-
In the first iteration, draw N independent steps from the LLM with temperature T to introduce diversity in responses. These steps are usually defined by stopping criteria, such as terminating at a new line \n or double new lines \n\n.
-
Score each step using PRM and select the top N/M steps as candidates for the next round of generation. Here M represents the “beam width” for a given active path. Like Best-of-N, the “Last” reduction is used to score the partial solutions for each iteration.
-
Expand the selected steps by sampling M subsequent steps in the solution.
-
Repeat steps (3) and (4) until reaching the EOS token or exceeding the maximum search depth.
By allowing PRM to evaluate the correctness of intermediate steps, beam search can identify and prioritize promising paths early in the process. This stepwise evaluation strategy is particularly useful for complex reasoning tasks like math, as validating partial solutions can significantly improve the final outcome.
Implementation Details
In the experiments, HuggingFace followed DeepMind’s hyperparameter selections and ran beam search as follows:
-
Compute scaling at N of 4, 16, 64, 256
-
Fixed beam width M=4
-
Sampling at temperature T=0.8
-
Up to 40 iterations, i.e., a tree with a maximum depth of 40 steps
The results are astonishing: under the test-time budget of N=4, beam search achieved the same accuracy as Best-of-N at N=16, achieving a computational efficiency improvement of 4 times! Additionally, the performance of beam search is comparable to Llama 3.1 8B, requiring only N=32 solutions per problem. The average performance of PhD students in computer science in math is about 40%, so nearly 55% for the 1B model is already quite good!
Which Problems Does Beam Search Solve Best?
While it is clear overall that beam search is a better search strategy than Best-of-N or majority voting, DeepMind’s paper indicates that each strategy has trade-offs depending on the difficulty of the problem and the test-time compute budget.
To understand which problems are best suited for which strategy, DeepMind calculated the distribution of estimated problem difficulty and divided the results into five quintiles. In other words, each problem is assigned to one of five levels, with level 1 indicating easier problems and level 5 indicating the hardest problems. To estimate problem difficulty, DeepMind generated 2048 candidate solutions for each problem and conducted standard sampling, proposing the following heuristic methods:
-
Oracle: Using basic fact labels to estimate the pass@1 score for each problem and classify the distribution of pass@1 scores to determine quintiles.
-
Model: Using the average PRM score distribution for each problem to determine quintiles. The intuition here is that harder problems will have lower scores.
The following figure shows the breakdown of various methods based on pass@1 scores and four test-time compute budgets N=[4,16,64,256]:
It can be seen that each bar represents the test-time compute budget, and within each bar, the relative accuracy of each method is displayed. For example, in the four bars for difficulty level 2:
Majority voting is the worst-performing method across all compute budgets, except for N=256 (where beam search performs the worst).
Beam search performs best at N=[4,16,64], but Best-of-N is best at N=256.
It should be noted that beam search has made consistent progress on medium and hard difficulty problems (levels 3-5), but on simpler problems, especially with larger compute budgets, it often performs worse than Best-of-N (and even majority voting).
By observing the result tree generated by beam search, HuggingFace realized that if a single step was assigned a high reward, the entire tree would collapse on that trajectory, affecting diversity. This prompted them to explore a beam search extension that maximizes diversity.
DVTS: Enhancing Performance Through Diversity
As seen above, beam search has better performance than Best-of-N but often performs poorly on simpler problems and with larger test-time compute budgets.
To address this issue, HuggingFace developed an extension called “Diversity Validator Tree Search” (DVTS), aimed at maximizing diversity when N is large.
DVTS operates similarly to beam search but with the following modifications:
-
For a given N and M, expand the initial beam into N/M independent subtrees.
-
For each subtree, select the step with the highest PRM score.
-
Generate M new steps from the nodes selected in step (2) and choose the step with the highest PRM score.
-
Repeat step (3) until reaching the EOS token or maximum tree depth.
The following figure shows the results of applying DVTS to Llama 1B:
It can be seen that DVTS provides a complementary strategy to beam search: when N is small, beam search is more effective at finding the correct solutions; but when N is large, the diversity of DVTS candidates begins to play a role, yielding better performance.
Additionally, in the breakdown of problem difficulty, DVTS improves performance on simple/medium problems when N is large, while beam search performs best when N is small.
Compute-Optimal Scaling
With a variety of search strategies available, a natural question arises: which one is the best? In DeepMind’s paper (refer to “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters”), they propose a compute-optimal scaling strategy that selects search methods and hyperparameters θ to achieve optimal performance under a given compute budget N:
where is the correct answer to problem q.
indicates the compute-optimal scaling strategy. Since directly computing
is somewhat tricky, DeepMind proposed an approximate method based on problem difficulty, allocating test-time compute resources according to which search strategy achieves optimal performance at a given difficulty level.
For example, for simpler problems and lower compute budgets, strategies like Best-of-N are preferable, while for harder problems, beam search is a better choice. The figure below depicts the compute-optimal curves!
Scaling to Larger Models
This article also explores extending the compute-optimal approach to the Llama 3.2 3B Instruct model to observe at which point PRM begins to weaken compared to the capacity of the strategy itself. The results show that compute-optimal scaling works very well, with the 3B model outperforming the Llama 3.1 70B Instruct (the latter being 22 times larger than the former!).
What’s Next?
The exploration of scaling test-time compute reveals the potential and challenges of utilizing search-based methods. Looking ahead, this article proposes several exciting directions:
-
Strong Validators: Strong validators play a key role in improving performance, and enhancing the robustness and generality of validators is crucial for advancing these methods;
-
Self-Verification: The ultimate goal is to achieve self-verification, where the model can autonomously verify its outputs. This approach seems to be what models like O1 are doing, but it remains challenging to implement in practice. Unlike standard supervised fine-tuning (SFT), self-verification requires more nuanced strategies;
-
Incorporating Thinking into the Process: Integrating explicit intermediate steps or thinking into the generation process can further enhance reasoning and decision-making capabilities. By embedding structured reasoning into the search process, better performance can be achieved on complex tasks;
-
Search as a Data Generation Tool: This method can also serve as a powerful data generation process, creating high-quality training datasets. For example, fine-tuning models like Llama 1B based on correct trajectories generated by searches can yield significant benefits. This strategy-based approach is similar to techniques like ReST or V-StaR but has the added advantage of search, providing a promising direction for iterative improvement;
-
Calling More PRMs: The relatively limited number of PRMs restricts their wider application. Developing and sharing more PRMs for different domains is a key area where the community can make significant contributions.
Original link: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
Scan the QR code to add the assistant on WeChat