Cohere’s Insights on Programmatic Knowledge Driving LLM Inference

Cohere's Insights on Programmatic Knowledge Driving LLM Inference

The latest research from the Cohere team reveals that the reasoning capabilities of LLMs do not stem from simple retrieval of pre-training data (previously believed to be “approximate retrieval”), but rather from the learning and application of programmatic knowledge, similar to how humans learn general problem-solving methods from examples. This means that LLMs extract reasoning patterns from data like learning general programs, rather than memorizing answers. This discovery overturns the previous “approximate retrieval” hypothesis and points to a new direction for future LLM training: focusing on high-quality programmatic knowledge rather than pursuing unlimited expansion of data scale. Surprisingly, models of different scales (e.g., 7B and 35B) seem to learn different information from the same data.

Exploring LLM Inference Mechanisms

The reasoning capabilities of large language models (LLMs) have been a hot and challenging topic in AI research. Understanding how LLMs reason not only helps us better leverage the powerful capabilities of LLMs but is also crucial for the design and development of future AI. Previous assumptions suggested that LLMs primarily reason through approximate retrieval, i.e., retrieving knowledge fragments related to questions from vast pre-training data and combining them into answers. In other words, there has been a widespread assumption in the community that when LLMs reason, they perform some form of approximate retrieval, “retrieving” answers to intermediate reasoning steps from parameterized knowledge formed during pre-training, rather than engaging in “true” reasoning.

Cohere's Insights on Programmatic Knowledge Driving LLM Inference However, the latest research from the Cohere team challenges this view. They found that programmatic knowledge, rather than simple retrieval, is the key to LLM reasoning capabilities. This research provides a new perspective for understanding the reasoning mechanisms of LLMs and points to new directions for the training and optimization of future LLMs.

Question: How Do LLMs Reason?

How exactly do LLMs reason? Do they retrieve information like search engines, or do they apply logic and reasoning like humans? This question has long troubled researchers. The Cohere team’s research aims to explore the reasoning mechanisms of LLMs and clarify how LLMs learn reasoning from pre-training data. Their research focuses on mathematical reasoning tasks to investigate whether LLMs truly engage in reasoning or merely retrieve memorized answers.The core question of the research is: How do LLMs learn reasoning from pre-training data?

Background: The Application of the EK-FAC Influence Function

To explore the reasoning mechanisms of LLMs, the Cohere team employed a method called EK-FAC influence function. This method estimates the impact of each document in the pre-training data on the model’s output, revealing which knowledge is crucial for the model’s reasoning process.The EK-FAC influence function acts like a microscope, helping us magnify and observe the impact of pre-training data on model reasoning and identify the key information that most influences model reasoning, just like finding critical lines of code during debugging.

Experimental Setup: Data and Models

This study used Cohere’s Command R 7B and 35B models of different scales, selecting **5 million documents (2.5 billion tokens)** as samples of pre-training data, ensuring these documents are consistent with the pre-training distribution. The researchers designed three types of questions: factual questions, key reasoning questions (including two-step arithmetic [only for the 7B model], calculating the slope between two points [for both 7B and 35B models], and solving linear equations [only for the 35B model]) and control questions. Control questions were used to eliminate irrelevant factors like formatting that could affect the experimental results. The choice of mathematical reasoning tasks is due to the clear and controllable reasoning steps involved, making it easier to analyze the reasoning process of LLMs.

Example of reasoning questions (Figure 1):

Cohere's Insights on Programmatic Knowledge Driving LLM InferenceFigure: Examples of factual questions (left) and reasoning questions (right). Factual questions require knowledge retrieval, while reasoning questions require multi-step reasoning.

Example of two-step operations (7B model) (Figure 2):

Cohere's Insights on Programmatic Knowledge Driving LLM InferenceFigure: Example of two-step arithmetic reasoning by the 7B model.

Calculating slope (7B and 35B models) example:

Cohere's Insights on Programmatic Knowledge Driving LLM InferenceCohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure 3 & Figure 4: Examples of calculating the slope between two points by the 7B and 35B models. Although the final answer’s sign is incorrect, the reasoning steps are reasonable.

Example of solving linear equations (35B model):

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Example of the 35B model solving linear equations.

Example of factual questions (Figure 6):

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure: Example of factual questions, both the 7B and 35B models can answer correctly.

Research Findings: Programmatic Knowledge Drives LLM Reasoning

The core finding of the Cohere team is that LLM reasoning is more akin to programmatic generalization, similar to how humans learn general problem-solving methods, rather than simply retrieving answers from pre-training data. LLMs do not simply memorize and match answers, but learn the reasoning process.

Quantitative Results: Evidence of Programmatic Knowledge

Generalization Ability Analysis: Same Tasks, Different Data

The researchers calculated the Pearson correlation coefficients of document influence scores between different questions and found that there is a significant positive correlation between different questions of the same reasoning task (Figure 7), indicating that the model has learned generalizable programmatic knowledge. Even if the specific values in the questions differ, the model can apply the same reasoning process to solve the problems.

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure: Pearson correlation coefficients of document influence scores between different questions. The darker the color, the stronger the positive correlation. It can be observed that there is a strong positive correlation between questions of the same reasoning task (e.g., arithmetic, slope, linear equations).
7B vs 35B: The Impact of Model Scale

The 35B model exhibits stronger generalization ability in slope calculation tasks, indicating that larger models can better learn and apply programmatic knowledge. However, surprisingly, there is almost no correlation in the information learned by the 7B and 35B models from the same data.

Dependency Analysis: Weak Dependency, Strong Generalization

Compared to factual questions, reasoning questions have lower dependency on individual documents, and the influence distribution is more scattered (Figure 8), which further proves the importance of programmatic knowledge. The model does not rely on retrieving information from a single document, but learns and synthesizes programmatic knowledge from multiple documents, achieving programmatic generalization.

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure: Total influence of different top-k percentage documents. The total influence of reasoning questions is much lower than that of factual questions, and the fluctuations are smaller, indicating that reasoning questions have lower dependency on individual documents.
7B vs 35B: The Impact of Model Scale

The 35B model has a lower dependency on individual documents, which again confirms the stronger generalization ability of larger models. This suggests that increasing model scale helps LLMs better learn and use programmatic knowledge and reduces reliance on individual documents. The 35B model has larger and more fluctuating influence values on factual questions, indicating that it learns from fewer but more relevant documents. (Figure 9)

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure: Comparison of influence between the 7B and 35B models on the same question.

Qualitative Results: In-Depth Interpretation of Programmatic Knowledge

Answer Dependency Analysis: Not Just Simple Answer Matching

The researchers found that answers to factual questions appear more frequently in high-influence documents, while answers to reasoning questions appear very rarely (Figure 10). This further proves that LLM reasoning is not simply answer matching, but a reasoning process based on programmatic knowledge.

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Figure: Frequency of answers appearing in the top 500 high-influence documents. The frequency of answers to factual questions is high, while the frequency of answers to reasoning questions is very low.
7B vs 35B: The Impact of Model Scale

The 35B model has a lower dependency on answers, which corresponds to its stronger programmatic generalization ability.

Influencing Factors Analysis: Code and Formulas Are Key

Analysis of high-influence documents shows that these documents often contain programmatic knowledge related to reasoning processes, such as code, mathematical formulas, etc. This indicates that LLMs can learn reasoning steps and methods from code and formulas, confirming the importance of programmatic knowledge.

7B vs 35B: The Impact of Model Scale

The 35B model is more inclined to utilize programmatic knowledge, further demonstrating the advantages of larger models.

Data Source Analysis: The Greater Impact of Programmatic Data Sources

The study found that programmatic data sources such as code, StackExchange, ArXiv, and Markdown have a significant positive impact on solving reasoning questions, while data sources like Wikipedia and Trivia have more influence on answering factual questions (Figures 11 & 12). This suggests that different types of data have different impacts on different types of tasks, providing important guidance for constructing more effective pre-training datasets.

Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Proportion of different data sources in top-k documents. It can be seen that programmatic data sources like code and StackExchange have a greater impact on reasoning questions.
Cohere's Insights on Programmatic Knowledge Driving LLM Inference
Proportion of different data sources in bottom-k documents.
7B vs 35B: The Impact of Model Scale

The 35B model can better leverage knowledge from different data sources, reflecting the stronger learning ability of larger models.

Conclusion: The Future of Programmatic Knowledge

The research by the Cohere team reveals the key role of programmatic knowledge in the reasoning process of LLMs, overturning the previous view that LLMs primarily rely on retrieval for reasoning, and providing new ideas for the training and optimization of future LLMs: focusing on high-quality programmatic knowledge, such as code and formulas. Future research can further explore the learning mechanisms of programmatic knowledge and how to combine programmatic knowledge with other types of knowledge to build more powerful LLM reasoning systems.

Related Links

  • • Paper:https://arxiv.org/pdf/2411.12580
  • • Demo: https://lauraruis.github.io/Demo/Scripts/linked.html

Leave a Comment