👇 Follow our official account for the latest AI updates🌟
This article is based on an interview with Joel Hestness by Dr. Waku on his YouTube channel, published on December 25, 2024. Original content reference: https://www.youtube.com/watch?v=qC_lCFTOJU0

Summary: Joel Hestness on How Cerebras’ Giant Chip Challenges NVIDIA’s GPU Dominance in AI

This article focuses on Cerebras Systems and its new machine learning processor, with the following key points:
-
Cerebras Processor Architecture: Cerebras uses an entire wafer as a single chip, rather than cutting it into multiple chips, significantly enhancing performance (equivalent to 16-20 GPUs) and simplifying programming and scaling. Its 40GB SRAM memory is close to the core, resulting in extremely low latency, supporting weight streaming to the wafer for computation rather than transferring data to compute units, thereby improving efficiency. -
Large Language Model Training Capability: The Cerebras system is capable of running trillion-parameter models (comparable to GPT-4 scale) on a single device and supports horizontal scaling to train larger models (currently up to 24 trillion parameters), outpacing existing GPU-based solutions. They have successfully run trillion-parameter models on a single device and are exploring expert mixture models. -
Efficient Inference Capability: Cerebras recently released an inference solution that is four times faster than other hardware and 15-20 times faster than GPUs (processing 2100 tokens per second), thanks to its unique pipelined execution and domain-specific language (DSL) optimized kernels. -
Chip Defect Mitigation Strategy: The Cerebras processor employs a redundant design and uses firmware mapping to mask manufacturing defects, achieving a 100% yield. Failing cores are automatically masked without affecting overall performance. This differs from traditional chip manufacturing, which typically discards defective chips. -
Hardware Access Methods: Cerebras offers various access methods: direct hardware sales, cloud services (including reserved instances and on-demand usage), white-glove services (assisting clients with machine learning tasks), and inference API services (supporting open-source and custom models). -
Software and Ecosystem: The Cerebras system supports the PyTorch framework and is committed to enhancing its support for the Hugging Face model library, making it easier for users to utilize familiar tools. They are developing instant execution capabilities and optimizing various operations for better performance. -
System Architecture: The Cerebras system is a distributed system comprising three types of nodes: accelerator nodes (containing wafers), MemoryX nodes (storing weights), and SwarmX nodes (network communication). This design is better suited for handling large models and high-performance computing tasks, such as molecular dynamics simulations. -
Cooling Solutions: Cerebras employs an efficient cooling system, with power density comparable to NVIDIA’s H100 GPU, consuming about 15 kilowatts per device. -
Future Outlook: Cerebras is optimistic about the future of inference time computing and plans to provide computational resources to academia.
About Joel Hestness
Joel Hestness is a research scientist with significant contributions in machine learning and computer architecture. He leads the core machine learning team at Cerebras Systems and serves as a senior research scientist. Hestness’s research focuses on natural language applications, their scalability, training dynamics, and efficiency characteristics. He has investigated scaling laws for computational efficiency in projects like Cerebras-GPT and achieved breakthroughs in models like BTLM.
Full Interview
Host: Hello everyone. Today, I am with Joel Hestness, the head of the core machine learning team at Cerebras Systems. Joel, can you introduce yourself?
Joel Hestness: I’m doing well, glad to be here, and looking forward to our chat. I am the head of the core machine learning team and also a research scientist. My background is in hardware, and I have worked on heterogeneous system architecture. After graduate school, I joined Baidu Research, where I worked with Andrew Ng and many other talented individuals, focusing on some of the first deep learning models on large clusters composed of GPUs.
Some of the work we did there focused on scaling laws: how does the quality of these models change when we alter the dataset size and compute resources? We published the first paper demonstrating that deep learning can predictably scale to very large sizes. As dataset size, model size, and compute resources increase, the quality of the models also improves. This work may have been a precursor to Kaplan’s scaling law paper. I noticed a trend that language is the next important domain for deep learning, as it is the most compute-intensive area. Many of the models we studied, such as recurrent models, heavily rely on matrix multiplication. The transformer model based on matrix multiplication is different from earlier deep learning algorithms that used convolutional methods.
Another exciting aspect of Baidu Research was the systems side, where a group conducted due diligence for Baidu Ventures to decide which organizations to invest in. They examined various hardware companies, which exposed me to innovations from companies like Cerebras, Groq, and Graphcore. I recognized that Cerebras’s solution has particularly effective potential in large language models, and I believe this will be the future direction of the field. This has largely come to fruition, and I joined Cerebras about five and a half years ago.
Over time, I have worked on various tasks at Cerebras. Currently, my team primarily focuses on our pre-training work on the system. We pre-train very large models, such as the best Arabic-English multilingual model in the Middle East, which contains 30 billion parameters. We are now collaborating with a team called Core 42 to train the next generation of models. Much of my team’s work involves ensuring we have the right schemes and datasets, determining how to scale, and ensuring a stable training process. We also contribute our findings and methods to other teams at Cerebras. I work closely with our applied machine learning team, as we interact with clients daily. They handle some pre-training work, as well as a lot of post-training and adaptation work.
Host: That’s very cool. We are actually at the NeurIPS conference, and Cerebras has a booth there. They are advertising to all the machine learning attendees, trying to get people to learn more about their systems.
Joel Hestness: Yes, Cerebras has a processor designed specifically for machine learning. It is a full wafer processor, so we don’t need to cut the wafer into chips and package them like we do with GPUs or other accelerators; we can keep it as a complete wafer. The performance of the full wafer is equivalent to about 16 to 20 GPUs, and because we keep it as a whole, programming and scaling to very large applications is very straightforward.
Host: I think we discussed Cerebras on my channel before. Yes, what is unique is that they treat the entire wafer as one giant chip rather than breaking it into smaller chips, which is more common. So, it’s exciting to see different accelerators used for AI training and inference besides GPUs.
So Joel, what do you think is particularly exciting about what’s happening at Cerebras right now?
Joel Hestness: Absolutely, yes. We have some important things to discuss at the NeurIPS conference. One of the main topics my team is working on is ultra-large models. In recent months, we have demonstrated a trillion-parameter model running on our hardware in collaboration with Sandia National Laboratories.
The interesting part is that this is a trillion-parameter model that can run on our single device. For those who are less familiar, if you wanted to train a trillion-parameter model on GPUs, you would have to break it down into multiple independent devices and then stitch them together to perform the full computation.
The structure of our hardware architecture allows us to relatively easily store weights on a set of separate servers and then stream them to our device. This allows us to run the model on a single device. However, I think you actually don’t want to train a trillion-parameter model to convergence on a single device because it requires a lot of computation and takes a very long time.
That’s why we have horizontal scaling capabilities to perform data parallelism across multiple devices. This allows us to scale and train very, very large models. We look forward to the next steps in this area.
Host: A trillion parameters is roughly the same scale as GPT-4. So, that’s a lot of parameters. If you were doing it on GPUs, I think you might need, uh, how many GPUs? The memory on a GPU is less than 100GB. So you’d probably need at least 10 GPUs.
Joel Hestness: We have actually estimated that the 175 billion parameter GPT-3 used about 50 GPUs when they initially parallelized it. The trillion-parameter model we’ve recently heard about would require about 280 GPUs just to load the model into memory.
Host: Essentially just to load it into memory. Right, yes. For you, you have a chip, the Cerebras chip, and you have enough video memory or general memory on that chip to accommodate a trillion parameters. How much memory does one of those nodes actually have?
Joel Hestness: Yes. The wafer itself has 40GB of SRAM, so it’s not even DRAM, let alone memory off-chip. It’s somewhat like cache in a GPU, very close to the core, and we can access it with very low latency, like a few cycles. We can use this memory in various ways. So earlier, when the company was just starting, we treated it like an FPGA where you could map the entire algorithm onto the wafer, like the entire model, and then pipeline your activation values through it. We still support that architecture, which is actually another exciting topic I should circle back to. But we realized that for large-scale training, we needed to leverage the throughput capabilities of the system. Therefore, we decided to change our strategy: let’s move the weights off the wafer and stream them in for a kernel-style execution similar to that performed on GPUs. Each kernel runs on a wafer at a time, and then you run it on a set of activation values that reside in that memory.
Host: You are basically moving computation to the data rather than moving data to the computation, right? Or what do you mean by that?
Joel Hestness: Pretty much. The parameters are things that change with each training step, while the activation values change at each layer, so they are frequently changing, and we reuse them often. These are the ones we want to process close to the core. So we move the weights from off-chip because they are updated less frequently.
Host: And the activation values are constantly changing.
Joel Hestness: Yes. So in our pipelined execution model, when we train that way, the weights remain on the wafer, so they occupy a lot of storage space. We are actually limited, and the model size is limited. Then we also have to store the activation values used for training backpropagation. Based on that observation, we released a new version. Just three months ago, we released our inference solution, which essentially uses our previous old pipelined execution method to lay out the model on the wafer and pipeline the tokens through it. Therefore, we can perform inference on the 70 billion parameter Llama 3 series at a rate of about 2100 tokens per second. This is currently about four times faster than any other hardware (such as Groq or Sambanova) and about 15 to 20 times faster than current GPUs.
Host: That’s quite fast. Yes. So you only have a limited amount of SRAM on die, I guess. If the model parameters all fit there, you can decode them all by putting them all there. I guess that’s somewhat close to what Groq is doing. But you also have a lot of off-chip memory that can serve as a complete set of weights. Obviously, you can cache some of that into local activations, but I guess the complete set of weights will be in off-chip memory. In terms of the amount of off-chip memory you can use, are you essentially unlimited?
Joel Hestness: Essentially, yes. The trillion-parameter model we recently demonstrated used a set of servers we call MemoryX nodes, which are essentially large DRAM storage systems. Then we can stack them up; we can parallelize across them, feeding data into the box. This set of supporting nodes is essentially infinitely scalable.
While we recently demonstrated a trillion-parameter model, we had actually demonstrated a trillion-parameter model internally over a year and a half ago. If someone wants to train models of such a large scale, we now have the capability to reach about 24 trillion parameters.
So we are starting to explore expert mixture models. We have supported them in recent versions, and we expect their scale to continue to grow as additional parameters can improve model quality. We expect people to train models larger than what we have heard from OpenAI or decoded.
Host: So you are VLSI for GPUs or ML hardware, I guess. Yes, for us, no number is too large. That’s interesting. At Baidu Labs, you were essentially playing the role of a venture capitalist, examining and thinking: well, should we invest here? Should we invest there? Did you think Cerebras had the best solution at that time? Or did you think all approaches were reasonable, and you just ended up collaborating with Cerebras?
Joel Hestness: I think Cerebras is the best choice. I think I might have had some unpopular opinions at the time. Many hardware companies initially aimed at inference, and I didn’t think that was necessarily the right solution. Focusing first on inference simplifies the problem but may limit the future generations of hardware.
In contrast, Cerebras focuses on training. If you are going to do training, you also have to consider inference, but you also have to deal with everything else: backpropagation, optimizer steps, and weight updates. I appreciate their ambition, which is commendable. They have many options, and their focus on training provides them with a solid foundation.
The inference solution we just released is a prime example of this approach. For three and a half years, we focused solely on training. About a year ago, we decided to revisit inference as a potential solution we could offer. Sure enough, in the past year, we have been able to find a robust and effective solution.
Host: Very good. So is it easy to operate on the existing text you have?
Joel Hestness: Yes. Actually, it’s interesting; Cerebras has two separate teams responsible for the stack. We have one team focused on the machine learning stack, which is what most people use, and we’ve put a lot of effort into that. We also have a high-performance computing stack. That stack focuses on being able to build kernels, so you can run specific types of low-level computations.
The team that built that stack actually created a domain-specific language, and we are working to make it more general. But this domain-specific language is very useful for generating optimized kernels, and if you want to do that, you can add it to our PyTorch stack.
This is a key driver for the inference solution: in fact, we have tools that can write simplified versions of kernels and compile them. Therefore, we can very quickly generate most kernels for these large language models during inference.
It is also highly adaptable. Therefore, we are extending it to different types of models to be able to run those models on the hardware.
Host: So these kernels sound a lot like direct comparisons to CUDA kernels, right? It’s like you write some optimized code, and then you can slice it into your…
Joel Hestness: Yes, the level of abstraction is very similar.
Host: Yes, very interesting. Maybe it’s time to transition to the second part, which is the hardware. I think Cerebras made headlines when you produced the largest wafer or single chip. How big are these chips?
Joel Hestness: Yes, the chip is a complete 300mm wafer. This is the largest square we can cut from it. In fact, the corners are not square; they use the edges of the wafer. Essentially, the design of the wafer is a large, simple core grid. So it’s a very simple logical architecture.
You can think of each individual core as being similar to CUDA cores in a GPU. It has narrow vector widths and performs scalar-vector multiplication. When we map operations like matrix multiplication onto it, it is essentially just a large contraction array, and we pipeline weights or activation values through it to perform the computation.
We have a total core count of one million in our CS3 generation. Compare that to the tens of thousands of cores in current AMD or NVIDIA hardware. This is where we get our 16 to 20 times performance difference.
In a single wafer, in a box, our performance is equivalent to the performance of two NVIDIA DGX node boxes, each with eight GPUs.
Host: You would need as many GPUs as possible for the nodes, even then you would still need two to get roughly higher performance. I know you mentioned that when producing wafers, they are round. So you would put the largest square in there. Is that to simplify operations? Or why don’t you use…
Joel Hestness: Yes, mainly to simplify. One interesting aspect of the design is that it is intended to look like a logical array. For programmers, we want it to be a very simple structure. So squares are natural. It’s not entirely square. In two dimensions, it’s not completely square, meaning that the number of cores in one dimension differs from the number in another dimension.
But logically, we set it up to present to the user as a regular core grid. I think one of the questions you are interested in is reliability. When manufacturing such a wafer, you have to deal with manufacturing defects. During the manufacturing process, we do have some cores that underperform or do not work.
So we have to turn them off and work around them. We have a redundancy process that helps fix issues, resulting in the logical core set presented to programmers.
Host: A perfect regular grid. Oh, so even if there are some failing cores, what programmers see is the exact working core grid? Yes, that’s right. That’s very useful. It’s like an SSD that hides the failed areas.
Joel Hestness: Very similar to SSDs and DRAM, which will add redundant rows and map them out.
Host: Yes. This also reminds me that when producing wafers, there are usually quite a few defects. Typically, I don’t know, Intel might get 40 or more chips from a single wafer. If there are too many defects in a single chip, they will discard it. This is what we call yield; it refers to the number of chips that are actually usable. Many companies will discard a chip if there is a single defect.
Some companies try to do better. For example, when IBM manufactured the Cell processor for the PlayStation and Game Boy, they took a different approach. The Cell processor was supposed to have nine cores to perform auxiliary tasks. If all nine cores worked correctly, they would put it in the PlayStation. If eight or seven cores worked correctly, they would put it in the Game Boy because it didn’t need that many cores. If fewer working cores were available, they would discard it.
However, when the entire wafer is a chip, your job becomes much more difficult. Do you have a threshold like that: too many defects, and it’s unusable? Or when you have to use a whole wafer, what is your actual yield?
Joel Hestness: Yes, yield is very important to us. The main statistical challenge is that defects may be independent, which means the probability of a certain number of defects is, you know, 1 minus the defect probability raised to some power, which is the number of cores or something else. The exponent of this core count is very difficult for us.
So we ensure we have enough redundancy to statistically cover all the defects we see in the standard process. Essentially, we achieve 100% yield for the wafer. However, we do have some wafers that end up slightly above the threshold, making it challenging to present a complete logical grid.
Therefore, those parts ultimately become engineering components for internal use, and we limit the grid in different ways. This approach is very flexible; we can change the standard structure (which we call the standard structure), which is the logical image that can run on the device.
Host: It’s almost like having a million chips instead of one chip. You just have to look at how many of those chips actually work. Then, you build a virtual layer or network that indicates how to access those chips that are actually working. How is this mapping, this virtual mapping, achieved? Is it all done in hardware, or do you flash some firmware to it? Or how does that work?
Joel Hestness: Yes, it is essentially implemented in firmware. In fact, one very cool aspect is that when we perform burn-in testing on the system, when we turn them on, we put them into a cluster to run. My team runs a lot of workloads there. Sometimes, you will find that the reliability of different cores changes during the early use of the system, due to different thermal stresses, etc.
Host: You try one thing, and the core works; you try another thing, and the core doesn’t work again?
Joel Hestness: Yes. I mean, we are putting the biggest workloads into these things; we are essentially burn-in testing them, and we are stress-testing them early so we can see if we need to fix issues before delivering them to customers. Whenever we see some problems, such as a wafer stalling, that stall is likely caused by packet loss or bit flips causing control registers to be incorrect. We have tools to scan and fix these issues, and then we can put the system back in and map out certain other cores. It’s surprising how quickly we can reintegrate them and rerun. Once we complete this process, the burn-in testing process tends to be very reliable.
Host: Also, that reminds me, I bought one of the very early Ryzen processors, like the 1800X, and they were prone to segmentation faults, if you ran certain workloads, you would get segmentation faults, just like in user space. I think it only affects user space, but still, it was a pretty serious bug. My processor was affected, so I had to run some tests, send it back, and then get a new chip, and that chip was fine. But one thing that impressed me was that the system had been very unstable and buggy for a long time. I think it was memory instability. I had no idea, right? So I feel that especially in CPUs, I mean, you have simpler cores, but in CPUs, for example, one defect can manifest in very strange ways, right? So have you planned for all the different ways, I mean, if this part of the chip fails, we should expect this to happen; if this part of the chip fails, we should expect that to happen?
Joel Hestness: Yes, we have a huge taxonomy. We have a dedicated team that focuses on detecting and fixing these issues because they are very difficult to debug. I think I would say this is akin to the convergence issues of the system. For example, if you are trying to train something very large, and the model becomes unstable and crashes midway through training, how do you debug it? Well, before you check it, do you have to spend all computing resources again to get back to the same point? That’s a tricky issue, and hardware faces the same challenges, so if you see a NaN during training, and the gradient norm looks fine, and everything else looks good, that could be a system issue, so we have to have a taxonomy of different failure modes, and then we have a bunch of different strategies and tools to debug these issues in different ways, you think…
Host: Do the chips you produce have the capability to guard against various failures in the future? For example, I imagine that if your hardware runs for five years and two cores stop working, how would you handle that? Would you analyze it again and take those cores offline to continue running? Or if a failure occurs, is it more likely to be an all-or-nothing situation, like some core circuits of the chip are damaged, preventing the chip from starting at all?
Joel Hestness: Yes, that’s a bit of a tricky question, partly because Cerebras has not been around for long. Therefore, the total amount and total time of the systems we use still need to be observed to draw conclusions.
We are fortunate; I think our CS2 wafers are performing well. We have two large CS2 clusters that are actually our most reliable systems. They have been running at full capacity for two and a half years, maintaining full utilization during that time. So they are performing very well.
Regarding future protective capabilities, yes, I would say the ability to repair wafers and still present a logical core grid is very valuable. Therefore, if certain cores appear to be on the edge, we won’t let them continue to run because they may not be reliable; instead, we will map them out directly.
So I think this should help extend the lifespan of individual systems. However, I think further exploration is needed to see if there are serious failures we believe would damage the entire system. So far, things have been good.
Host: Very good. That’s really cool because server hardware is typically designed to run for 10 years, right? They have high design standards, but if the data center has enough servers, failures will still occur frequently. Most failures require replacing servers, which is not good for the environment, and it’s also not good because you have to constantly rely on the supply chain to provide those old parts, or you do partial upgrades, but now nothing matches, and so on. So I think it’s really cool because you are potentially building a system that can continue to run and compute even if it’s no longer the fastest system, 10 years later. So we are very hopeful about that.
Joel Hestness: Yes, just a few years ago, there were still NVIDIA M80 GPUs in AWS that were running very reliably. Yes, we hope our products will have that kind of longevity as well.
Host: Yes, that’s very cool. I mean, such a large chip generates a lot of heat. You mentioned you have data centers, which means more than one such chip. So how do you keep it cool?
Joel Hestness: That’s a great question. Our wafers are specially designed. If you look at the wafer—I hope there’s a demonstration here—there are some holes on the wafer specifically designed to help us connect the wafer to the power delivery from the back and the cooling system from the front.
On the front, the cooling plates connected to the wafer actually connect to a component we call the “engine block,” which circulates water. Inside the module, we circulate water to dissipate heat to a heat exchanger. These heat exchangers can ultimately use air or pump water to separate heat exchange facilities.
Interestingly, I think a few years ago, researchers conducting wafer-level studies thought the power of the entire wafer was too high. At that time, the rack power density was too high. But one cool thing we designed is that we can handle this power density. It’s actually very similar to the power density of NVIDIA’s H100 GPU.
Their DGX boxes are very impressive. So the cool thing is that our complete device consumes about 15 kilowatts, occupying roughly the same rack space as a couple of NVIDIA DGX boxes. When you start putting together eight 700-watt H100 GPUs (one box with 8), and then several boxes together, you find that the power density is quite similar to that of GPUs.
Host: I’ve actually talked to others at this conference who said a box with eight H100s is 11 kilowatts. If you use the upcoming NVIDIA chip B200 (Blackwell), it will increase from 11 kilowatts to 14 kilowatts, which is a bit too much. So in the past, you could fit four H100 nodes (each with 8 GPUs) into one rack, which fits many standard data center power options. But now you actually have to do it because they are right at that limit. So that’s interesting. But I guess if you are using 15 kilowatts, you can even fit two of them in those power-constrained racks. Of course, each of those racks is much more powerful than a node filled with Blackwell chips.
Joel Hestness: Yes, so the performance density ends up being very similar to H100 and B100. Within a rack, we have exactly the same limitations. Different data centers have different power limits per rack, so we have to adapt to that. So in some cases, we place one box in a rack. Recently, we have been able to start stacking and placing two boxes in each rack. So it’s very similar to the constraints NVIDIA faces.
Host: Yes. I think the main difference is that you are shifting much of the complexity of dealing with wafer defects from hardware to software, in traditional methods, they have to discard a lot of chips. I don’t know what NVIDIA’s yield is, but they certainly have to discard some chips. You are shifting that complexity to firmware/software, which is difficult to detect because you have to look for all those faults. But that’s very flexible, right? It’s like you can say, “Let’s turn that off right now,” and the hardware can keep working, so your overall yield is essentially 100%, which is really cool. So if you think about the resources used to produce two Cerebras chip nodes or two nodes filled with Blackwell chips, I guess you are able to utilize the supply chain better, utilize silicon better, etc. I mean, you are taking on more complexity, but it brings a lot of advantages.
Joel Hestness: It plays an advantage by simplifying parts of the supply chain. I think there are fewer components. We can more easily deal with reliability issues one component at a time. Therefore, the reliability of the assembly is comparable or better.
Host: Also. Okay, let’s continue discussing the third part, which is hardware access. Specifically, how do Cerebras customers access the hardware? You produce chips; do you sell them directly? Do you sell cloud access? How does that work?
Joel Hestness: Yes, we have four different interaction methods. We do sell hardware. We have been deploying it for some time now. Only a few organizations have the capability to do this deployment.
Host: If possible, what is the approximate price of a Cerebras node? I might need to talk to my people at the company. I guess it’s a few million dollars.
Joel Hestness: Our goal is to have price performance comparable to GPU products. Yes, we sell hardware. We also have our own cloud internally. So you can contract with us to use our cloud. It’s a bit like a reserved instance setup.
We also have several different vendors. If you need HIPAA compliance or anything else, we have vendors that can help you securely store data, etc. We provide white-glove services, where we actually work with clients on some machine learning tasks. Currently, that’s where most of our clients are.
They come to us with machine learning expertise, either because they don’t want to deal with the complexities of the hardware or because they simply don’t have the personnel. They don’t have the capability to do it themselves. So they come to us, and we help them run directly on the hardware. We can assist with machine learning services.
Recently, we launched the inference solution I mentioned. We just released an API where you can send requests to our system. Our inference infrastructure is also set up in our cloud. It operates similarly to the APIs from OpenAI or Anthropic.
Host: Groq has something similar, right? You just send data, send a query in the cloud, and then get the inference results back. Right, exactly. For your inference API, do you only use open-source models? Or can companies provide their own models? I guess in that case, they would probably need to purchase hardware.
Joel Hestness: We do both ways. This is still the latest development. We are in discussions with a few different clients. Our existing open API uses open-source models. So we support all LLAMA 3 models. We have introduced version 3.3, which was just released last week. We are also in talks with some large organizations that need more throughput and lower latency. In those cases, I expect we will also deploy custom models. We have the capability to build support for custom models. So if you have specific layers or other content that differs from LLAMA, we can do that.
Host: Very interesting. Are all your cloud systems in the United States, or do you have some elsewhere? I know, for example, in Europe, you need to comply with GDPR and data governance requirements, as well as in Canada. So what’s the situation for you?
Joel Hestness: Yes, we have done everything we need to do to provide cloud services in a certain region. Most of our cloud services are located in the United States, but we also have some different cloud services available in Europe. We have a green cloud, a HIPAA-compliant cloud. We also work directly with some national laboratories and university organizations that have data centers. Therefore, academia can apply for computational time funding to access our hardware. We have these distributed around the world. Sounds like a pretty wide range, actually. That’s cool. We try to ensure we can serve as many regions around the world as possible. Yes, while also complying with any security requirements.
Host: Yes. So when customers access your hardware, do they have to use special compilers or Python packages, or… because I assume people want to use their familiar Python code, which is what all machine learning practitioners like to use nowadays. So how does that translate to the Cerebras system?
Joel Hestness: Of course, yes. Our stack supports various frameworks, with PyTorch being the most commonly used one. Most people are using our PyTorch framework. The front end is standard PyTorch, so many of the models you see on Hugging Face can be directly ported to our architecture. We have set it up so that you can install a package for our backend via pip, so it’s somewhat like using PyTorch for another accelerator backend. It’s just a package installed via pip.
Host: So you will need to merge any custom code you need into PyTorch so that it is included by default?
Joel Hestness: Folder? We haven’t included it in standard PyTorch yet. I think some things might not be PyTorch-specific. For example, PyTorch has its front end, which is the model, but it also has internal intermediate representations, which are the A10 tensor libraries. We basically take the intermediate representations around the A10 level and convert them into something that can lower to our hardware. Currently, we fully support compiling models to hardware. Compilation can provide optimal performance, and we are developing infrastructure that will allow us to perform instant execution. So most people using PyTorch are used to being able to actually call each operation and check the kernel’s results. We have some preliminary support for this and plan to build it more comprehensively.
Joel Hestness: The stack comes from the framework. We go through several different intermediate representations, and for those familiar with compilers, you bind different aspects of the computation to the bits of hardware. So our stack handles binding different parts at different levels.
Host: So you pre-allocate, like this operation or that operation will run on this specific logical core?
Joel Hestness: Yes, exactly. I see. So it’s like that. It runs on a specific device. It runs at a specific precision, and so on. All of these need to be resolved. How is memory laid out? And things like that.
I should also mention that we have recently been testing the direct porting of models from Hugging Face. So basically, what we’ve done is build a pipeline that allows us to extract different modules from models in Hugging Face and then independently test the compilability of those modules. We are building quite a comprehensive feature set to support Hugging Face models. Now we are focusing on about 1400 repositories. So I’m excited that we will be able to support almost everything you have there.
Host: That’s cool. Interacting with the ML ecosystem is difficult, right? Because you have to be, we implemented CUDA, or we re-implemented PyTorch, and so on. Right.
Joel Hestness: Yes, so whenever we say we support PyTorch, it’s a constrained subset of it, and we are actively building towards a complete PyTorch set. But the complete PyTorch has about 2000 or 2200 operations. Yes. Many of these operations are hardware-specific. They perform operation fusion and such, which may not be necessary on our hardware. So we have to carefully decide which ones to build over time.
Host: So inside the chip, it sounds like you have some kind of network between the different cores, like an interconnect, right? So they can send packets to each other and so on. I also think if you have multiple nodes, you must have some kind of network, Ethernet, InfiniBand, I don’t know, they must be able to communicate quickly enough between each other. What kind of special technology do you have there, or do you just use off-the-shelf technology?
Joel Hestness: We currently use off-the-shelf Ethernet. The complete system device architecture involves three different types of nodes. So these are our accelerator nodes, which contain the wafers, but they connect to a set of nodes called MemoryX nodes. These nodes are used to store weights. Then there is a set of nodes we call SwarmX, which are the nodes that perform the communication between MemoryX nodes within the box and between boxes. These are essentially network nodes.
Host: Okay, so you have quite a few different types of nodes. Clearly, there are some with Cerebras processors, and there are some that are just, there are some storage systems, I guess they have a lot of memory. Then there are the network-type nodes, which might be standard routers and switches, etc., to ensure everything can communicate.
Joel Hestness: Okay, yes. So when we train a model, the model resides in the MemoryX nodes. We use a weight streaming architecture, somewhat like a parameter server, where these MemoryX nodes send the model to the various compute units (boxes) via our SwarmX nodes. We have a software architecture that virtualizes on SwarmX that can broadcast weights to the individual compute units. Then it will also use a reduction tree to reduce gradients for the final optimizer step. We are also expanding communication primitives, which are collective communication operations you can use to interact between devices. Therefore, there will soon be some interesting opportunities to utilize this infrastructure in different ways.
Host: So you not only have Cerebras chips but also an actual distributed system with various specialized nodes, etc. This reminds me of the machine learning hardware co-design that many people have been discussing at the conference. Just as you specialize the machine learning algorithms, for instance, using 2-bit quantization instead of 4 or 8-bit computations, you know, the hardware has to change. And what you have looks like a very good foundation for future evolution because it is a distributed system. You have already done specialized designs, like this is where the memory is, this is where the processor is, etc., which sounds very cool.
Joel Hestness: Of course, our goal is to achieve quite a general-purpose full function. I think we need some time to figure out where the interesting loads are. But for now, deep learning is a great target application and represents a good example of the more general things we are doing.
There may be another related aspect. I mentioned we also have a team focused on high-performance computing work. The workloads they are focusing on are very interesting, such as molecular dynamics or partial differential equation solvers based on rule codes (like template operations) that need to communicate and update with neighbors.
This was actually the initial inspiration for our deep learning weight streaming matrix multiplication architecture. But because the wafer is very large and has a large memory capacity, it’s great for template code because template code is compute-intensive, and it’s difficult to do that with distributed memory architectures.
So you have to access DRAM HBM memory to get part of the template data. Those applications we announced a few weeks ago at the Supercomputing Conference run about 100 times faster than contemporary hardware.
Host: Well, thank you for telling us all about Cerebras. This is very interesting. Is there anything else you would like our audience to know?
Joel Hestness: Of course, yes. I think now is an excellent time to engage in deep learning, large-scale training, and computation. I believe inference will soon become very important. I mentioned our inference solution is already available. Perhaps I should add that if academia wants to work on our inference solution, we can provide funding. It’s extremely fast, so if you are working on workloads with real-time requirements or if you are doing inference time computing, please let us know. We can try to find some computational resources for you. We believe that inference time computing will soon become very important. It’s exciting that there is also a new scaling law that just increases the amount of computation for inference time, and you can achieve better quality. It will be very interesting to see how this relates to the scaling of training time. That’s where we stand today.
Host: I think there’s a lot of parallel development in all of this because models like OpenAI’s O1 essentially leverage inference time computing to achieve about 100,000 times the efficiency compared to just using computation for training. You are also doing the same thing, but that’s not a response to the existence of that model because you were doing it before the O1 model existed, so you know everyone is considering inference time computing options. This is an exciting time. Thank you very much for coming, and thank you all for watching.
Follow our official account for the latest AI updates🌟
Previous Highlights
-
Interview with Jensen Huang: OpenAI Achieves ‘Escape Velocity’ in the Large Model Battle -
Li Feifei and Justin Deeply Interpret Spatial Intelligence: The Digital World Needs 3D Representation to Integrate with the Real World -
Peter Thiel, Founder of PayPal: Human Technological Stagnation is Due to Atomic Progress Lagging Far Behind Bits -
Google Co-Founder Brin: The ‘God Model’ Built by Giants Almost Understands Everything -
Elon Musk: AI Will Drive the Cost of Goods and Services Close to Zero -
Karpathy’s Latest Interview: Humanoid Robots, Tesla, Data Walls, and Synthetic Data