This article is packed with valuable insights! Feel free to share and save!
With the release of OpenAI o1, inference has finally entered the spotlight we have been anticipating for the past year. Nvidia’s CEO Jensen Huang once said, “The scale of inference will be a billion times larger than today.” (By the way: from the perspective of query volume, this might be true, but if inference indeed accounts for 40% of Nvidia’s revenue, this claim is impossible to achieve in terms of income.)
With o1, inference has for the first time constituted a meaningful part of the total computation of the model.
Source: https://www.fabricatedknowledge.com/p/chatgpt-o1-strawberry-and-memory
It points out a new potential scale law, that the longer the model “thinks,” the higher its accuracy. Stratechery (https://stratechery.com/2024/enterprise-philosophy-and-the-first-wave-of-ai/) describes this performance improvement very well:
o1 has been explicitly trained on how to solve problems, and it is designed to generate multiple problem-solving streams during inference, selecting the best solution, and iterating each step upon realizing mistakes. That’s why it can solve crossword puzzles—though it takes a long time.
Last month, Anthropic announced the “Computer Use” feature, enabling models to interact with computers like humans. This indicates that AI applications will become increasingly complex, leading to an increase in inference volume.
Two factors make this market particularly interesting: the reduction of computational costs and the fierce competition in the field.
Source: https://cloudedjudgement.substack.com/p/clouded-judgement-92724-the-foundation
With the reduction of inference costs, the rapid expansion of market size, and the intensifying competition, this market provides a very interesting case study for artificial intelligence.
This article will delve into the current situation, the variables determining market direction, and how value flows within the ecosystem based on these variables.
Clearly, inference is an emerging market, and this field is very crowded and rapidly changing. The best inference performance metrics we currently have are third-party benchmarks (if you have more accurate data, feel free to contact us).
1Background of Inference
First, inference provides a more open competitive market than training. During training, a model representing a complex scenario is created through iterations on large datasets, while inference is the process of inputting new data into the model for predictions.
Source: https://www.linkedin.com/pulse/difference-between-deep-learning-training-inference-mark-robins-mdq8c/
Some key differences are particularly important in inference:
-
Latency and Location are Crucial: Since inference runs workloads for end users, response speed is critical, which means that performing inference in edge or edge cloud environments may be more meaningful than training. In contrast, training can be done anywhere.
-
Importance of Reliability (Slightly) Decreased: Training cutting-edge models may take months and require large-scale training clusters. The interdependence of training clusters means that an error in one part of the cluster may slow down the entire training process. In contrast, during inference, the workload is much smaller and the interdependence is lower; if an error occurs, only a single request is affected and can be quickly re-run.
-
Importance of Hardware Scalability Decreased: One of Nvidia’s key advantages is its ability to scale larger systems through software and networking advantages. In inference, the importance of this scalability is lower.
These reasons collectively explain why many new semiconductor companies focus on inference, as the barriers to entry are relatively low.
It is worth noting that while “inference” is a broad term describing the actual use of models, it encompasses various types of machine learning models. My colleagues have written here about the changes in ML deployment methods in recent years. Here are the performance differences across different workloads:
Companies running inference have many options. From the easiest to manage and least customizable to the most difficult to manage and most customizable, companies have the following options for inference:
-
Base Model API: APIs from model providers like OpenAI. The simplest and least flexible option.
-
Inference Service Providers: Specialized inference service providers, such as Fireworks AI and DeepInfra, aim to optimize costs across various cloud and hardware providers and are a good choice for running and customizing open-source models.
-
AI Cloud: GPU or inference as a service from companies like Coreweave and Crusoe, where companies can rent computing power and customize as needed.
-
Hyperscale Cloud Vendors: Hyperscale cloud vendors provide computing power, inference services, and platforms where companies can develop dedicated models.
-
AI Hardware Providers: Companies use their own GPUs and optimize according to specific needs.
Additional Information 1: From API to AI Hardware—companies like Groq, Cerebras, and SambaNova have begun offering inference cloud services that allow customers to leverage their hardware in the form of inference APIs. Nvidia acquired the inference service provider OctoAI, presumably to create its own inference service.
Additional Information 2: Edge Inference—Apple, Qualcomm, and Intel seek to provide hardware and software to enable inference directly on devices.
Given that base model APIs are quite simple (companies call APIs from base model providers and pay as they go). I will start by introducing inference providers.
Several companies are emerging in providing inference services, abstracting the need to manage hardware. The most notable of these companies include inference startups like Fireworks AI, Together, Replicate, and DeepInfra. Kevin Zhang describes these companies well here:
API-only startups like Replicate, Fireworks AI, and DeepInfra have completely abstracted all complexity, allowing models to be accessed via API calls. This is similar to the user experience provided by base model providers like OpenAI for developers. Therefore, these platforms typically do not allow users to customize choices such as GPUs for specific models. However, Replicate has Cog for deploying custom models and other tasks.
Meanwhile, Modal and Baseten provide a middle-ground experience, where developers have more “tuning knobs” to control their infrastructure but still find it easier than building custom infrastructure. This finer-grained control allows Modal and Baseten to support use cases beyond simple text completion and image generation.
The most explicit use case for these providers is to provide inference services for open-source models, enabling companies to build applications using those models. Inference providers use various techniques to optimize costs as much as possible.
When choosing an inference provider, the main considerations are cost/performance calculations, including inference costs, latency (time to first output and time between outputs), and throughput (the ability to handle demand). We have some understanding of pricing:
Interestingly, a recent change over the past few months is that hardware vendors have started to enter the inference space. Nvidia acquired the inference provider OctoAI, likely to offer similar services. We can see three hardware vendors providing the fastest inference services on the market:
As always, one should remain cautious regarding benchmark results. According to Irrational Analysis (https://irrationalanalysis.substack.com/p/cerebras-cbrso-equity-research-report), Cerebras does not provide Llama 405B, possibly due to its unreasonable cost. Specific setups may achieve these results, but they might be incompatible with other models or impractical in production use cases.
The ROI calculations for most companies will be the ratio of total cost of ownership/performance, which is difficult to obtain at this stage of the industry lifecycle.
I would point out that AI clouds like Coreweave, Crusoe, and Lambda all offer inference services. Hyperscale cloud vendors do as well! Kevin Zhang also speculates that data platform and application infrastructure providers may also expand into the inference space:
Source: https://eastwind.substack.com/p/a-deep-dive-on-ai-inference-startups
In this competitive environment, companies need to either provide meaningful architectural differentiation, development tools for inference-based solutions, or achieve cost advantages through vertical integration to create meaningful differentiation.
4
Hardware Providers
The aforementioned inference providers abstracted the complexity of managing the underlying hardware. For many large AI companies, managing their own hardware makes sense. This includes infrastructure setup (installation, data center construction, or colocation setups), model optimization, performance monitoring, and ongoing hardware maintenance.
We can see hardware suppliers in the chip portion of the value chain:
If 40% of Nvidia’s data center revenue indeed comes from inference, then Nvidia is currently dominant in this market. As Huang pointed out, companies that already have leading training hardware may convert it to inference hardware when upgrading their equipment.
AMD is exploring this market, expecting its AI accelerators to bring in $5 billion in annual revenue. Most of the qualitative comments from their recent earnings call pointed towards inference workloads.
RunPod made an interesting comparison of H100 and MI300X in terms of inference, noting that MI300X has better throughput at high batch sizes due to its larger VRAM.
https://blog.runpod.io/amd-mi300x-vs-nvidia-h100-sxm-performance-comparison-on-mixtral-8x7b-inference/
MI300X is more cost-effective in very small and very large batches. As the blog points out, pure performance is just part of the evaluation. Nvidia’s lead in networking and software gives it an additional advantage in practical scenarios where system-level design is needed.
Several hardware startups have also raised significant funding to capture this market:
https://www.chipstrat.com/p/etched-silicon-valleys-speedrun
Again, it is important to note that the buyer’s calculation formula will be TCO/performance. Value will flow to the hardware level, and the question is how much value is created by the layers above the hardware.
There is another uncertain variable in the market, but it can determine a large portion of value accumulation in inference.
Austin from Chipstrat (https://www.chipstrat.com/) has done a great job in this regard. As Austin describes, edge inference is beneficial for all parties involved:
Companies will increasingly be motivated to shift these workloads onto consumer devices as much as possible—consumers provide the hardware and power resources that enable companies to generate intelligence.
This is a win-win situation: companies reduce capital and operational expenditures, while consumers enjoy the benefits of local inference. It is important to note that adopting local inference requires:
Incentivizing consumers (rewarding local inference business models, security advantages, etc.).
Useful small models that can run on edge devices.
The former seems quite simple. Models like o1-mini make the latter increasingly realistic. I don’t need Siri to be a compressed version of the whole network—I just need a reasoning tool that can handle simple tasks. What is needed is more like a trained fifth grader than a PhD generalist.
The question returns to developing hardware and software to meet user needs. I believe we can solve these issues over time.
https://www.generativevalue.com/p/the-ai-semiconductor-landscape
Companies are already developing hardware, such as Apple’s neural engine, AMD’s NPU, Intel’s NPU, Qualcomm’s NPU, Google’s Tensor, and startups like Hailo. With improvements in small models, edge inference will increasingly become a reality.
My view on edge inference:
If we look back at historical disruptive technological changes, they occur when new products offer fewer functions at far lower prices than existing products, which cannot compete. Mainframes gave way to minicomputers, minicomputers gave way to personal computers, and personal computers gave way to smartphones.
The key variable that triggers these disruptive changes is performance surplus. High-end solutions address problems that are non-essential for most people. Many disruptive changes in computing come from the decentralization of computing, as consumers do not need extra performance.
With AI, I have yet to see performance surplus. ChatGPT is good, but not great yet. Once it becomes great, then the door to AI in edge computing will open. Small language models and neural processing units will lead this era. The question is when AI will be realized in edge computing, not if it will be realized.
This market again returns to applications, and edge inference makes more sense for consumer applications.
6
The Future of the Inference Market
Inference workloads will ultimately follow the scale and form of AI applications.
The scale and intensity of artificial intelligence applications will be key factors determining the size of the inference market (i.e., how many applications are in use and their complexity). The form of these applications (i.e., who is building them) will help determine the shape of the inference market.
If the AI application market ultimately concentrates in the hands of a few companies like OpenAI, Microsoft, and Google, then the value of inference will flow to the underlying hardware of these vertically integrated companies.
If the AI application market ultimately becomes fragmented, with many companies holding smaller market shares, then the inference market will be more open. These smaller, non-vertically integrated companies will pay for management services from inference providers. Some companies may want more personalization or customization options than what simple APIs can offer.
If these applications can run on edge with sufficiently simple models, this will open the door for edge inference hardware.
Finally, all these variables are continuous rather than binary. Some inference will run on the edge, some applications will become highly complex logical reasoning machines, some applications will be owned by large model providers, while others will be won by startups.
This article is very informative! Feel free to share and save!
The edge computing community will welcome a new version in 2025!
Content will be comprehensively upgraded to provide more valuable information and resources for everyone, helping each district friend seize the opportunities and waves of the AI and edge computing era. Welcome everyone to star the edge computing community’s public account and join us in welcoming a smarter future!
The 2025 Edge Computing New Year Salon was successfully held, discussing the future of edge AI.
2025-01-14
Adaptive Layer Segmentation for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach
2025-01-06
Call for Papers for the 11th International Conference on Sensing Cloud and Edge Computing (IEEE SCECS 2025)
2025-01-03
Big News! The list of “Top 20 Edge Computing Companies in China 2024” has been released!
2024-04-09