Cerebras-GPT is a family of seven GPT models ranging from 111 million to 13 billion parameters. These models are based on the GPT-3 architecture, which is a transformer-based language model that can generate natural language texts from a given input. Cerebras-GPT models are trained using the Chinchilla formula, which is a scaling law that optimizes the training compute budget for LLMs. The Chinchilla formula states that the number of tokens used for training should be proportional to the number of model parameters, and that the learning rate should be inversely proportional to the square root of the number of tokens.
Table of Contents
Cerebras-GPT models were developed on the Andromeda AI supercomputer, which is made up of 16 CS-2 wafer scale systems. Each CS-2 system is built on a single wafer and has 400,000 AI-optimized cores as well as 18 GB of on-chip memory. Cerebras’ weight streaming technique is used in the CS-2 systems, which simplifies LLM training by decoupling computing from model storage. This enables effective training scaling across nodes via basic data parallelism.
Cerebras-GPT models are free source and distributed under the Apache 2.0 licence on Hugging Face and GitHub. They may be used for text synthesis, text summarization, question answering, sentiment analysis, and other natural language processing activities. Cerebras-GPT models may also be fine-tuned to increase performance and accuracy on certain domains or datasets. Cerebras pre-training and fine-tuning methods are available in the cloud through the Cerebras Model Studio.
Cerebras-GPT models are intended for usage and replication by anybody who wishes to harness the power of LLMs to create AI agents. Cerebras aspires to build a collaborative and inclusive AI community by offering free access to cutting-edge models trained on open datasets and architectures. Cerebras-GPT models also show the ease and scalability of training LLMs using the Cerebras software and hardware stack.
Cerebras-GPT: A New Model for Open LLM Development
Artificial intelligence has the potential to alter the global economy, but access to it is becoming increasingly restricted. OpenAI’s GPT4, the most recent big language model, was published with no details about its model architecture, training data, training hardware, or hyperparameters. Companies are increasingly constructing huge models with locked datasets and making model outputs available exclusively through API access.
We think that access to cutting-edge models that are open, repeatable, and royalty-free for both research and commercial applications is critical for LLMs to be an open and accessible technology. To this end, they developed Cerebras-GPT, a family of transformer models trained utilizing the latest techniques and open datasets. These are the first GPT models trained with the Chinchilla formula and provided under the Apache 2.0 license.
Large language models may be divided into two groups. Models in the first category include OpenAI’s GPT-4 and DeepMind’s Chinchilla, which are trained on private data to attain the maximum degree of accuracy. However, the training weights and source code for these models are not publicly available. The second category includes open-source models such as Meta’s OPT and Eleuther’s Pythia, which were not trained in a compute-optimal way.
DeepMind discovered that when 20 data tokens are used for each parameter in the model, large language models achieve the highest accuracy for a fixed compute budget. Therefore, a one billion parameter model needs be trained on 20 billion data tokens to get optimal results for a certain training expense. This is sometimes referred to as the “Chinchilla recipe.”
This finding implies that using the same amount of training data when training a family of model sizes is not optimal. For example, training a small model with too much data results in diminishing returns and lower accuracy gains per FLOP; instead, a larger model with less data would be preferable. A large model trained on insufficient data, on the other hand, does not reach its full potential; it is preferable to reduce the model size and feed it more data. In each case, using 20 tokens per parameter is optimal, per the Chinchilla recipe.
EleutherAI’s Pythia open-source model suite is particularly valuable for researchers since it provides a broad variety of model sizes while training on the public Pile dataset utilizing a regulated training process. Pythia, on the other hand, was trained using a set number of tokens across all model sizes in order to achieve an apples-to-apples baseline across all models.
Cerebras-GPT was meant to complement Pythia by covering a wide range of model sizes utilizing the same public Pile dataset and establishing a training-efficient scaling law and family of models. Cerebras-GPT is made up of seven models with 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B parameters, each of which is trained using 20 tokens. Cerebras-GPT delivers the lowest loss per unit of computation across all model sizes by utilizing the optimum training tokens for each model size.
New Scaling Law
Training a big language model may be costly and time-consuming. To maximize the model’s performance, a large amount of computing resources and knowledge are required. One approach to addressing this issue is to train a family of models of varied sizes, which can aid in the development of a scaling law that explains the link between training compute and model performance.
Scaling laws are critical in LLM development because they help researchers to estimate a model’s predicted loss before training, eliminating expensive hyperparameter search. OpenAI was the first to develop a scaling equation that demonstrated a power law link between computing and model loss. DeepMind then conducted the Chinchilla research, which demonstrated an ideal compute-to-data ratio. These research, however, used closed datasets, making it impossible to extend the conclusions to other datasets.
Cerebras-GPT advances this study by developing a scaling law based on the open Pile dataset. The resultant scaling law is a computationally fast formula for training LLMs of any size using Pile. We believe that by releasing our findings, we will be able to contribute a valuable resource to the community and assist the development of big language models.
Model Performance on Downstream Tasks
Cerebras-GPT performance was examined on multiple task-specific language tasks, including sentence completion and question-and-answer. This is significant because, while the models may have strong natural language understanding, it may not transfer to specialized downstream tasks. As seen in Figure 4, Cerebras-GPT retains state-of-the-art training efficiency for the majority of typical downstream tasks. Notably, while prior scaling laws shown scaling for pre-training loss, this is the first time results for scaling for downstream natural language tasks have been reported.
Cerebras CS-2: Simple, Data-Parallel Training
Training such big models on GPUs necessitates a high level of technical skill. OpenAI thanks over thirty contributions for computing infrastructure and scaling in the newly released GPT-4 Technical Report. We’ll look at existing LLM scaling approaches on the GPU to see out why.
Data parallel is the easiest approach to scale. Data parallel scaling replicates the model on each device and employs several training batches on those devices, averaging their gradients. Clearly, this does not address the issue of model size; if the complete model does not fit on a single GPU, it fails.
A typical alternative technique is pipelined model parallel, which runs distinct layers as a pipeline on multiple GPUs. However, as the pipeline depth develops, the activation memory grows quadratically, which can be prohibitive for big models. To circumvent this, another frequent option is to split layers across GPUs, known as tensor model parallel, however this requires extensive communication between the GPUs, which complicates and slows down the implementation.
Due to this complexity, there is currently no single approach to scale on GPU clusters. Training big models on GPUs necessitates a hybrid strategy that incorporates all types of parallelism; the implementations are complex and difficult to set up, and there are substantial performance difficulties.
Two recent big language models (Figure 6) demonstrate the complications inherent in dividing large language models across many GPUs. Meta’s OPT model, with parameters ranging from 125M to 175B, was trained on 992 GPUs utilizing a combination of data parallelism, tensor parallelism, and memory optimization approaches. Eleuther’s 20B parameter GPT-NeoX model was trained over 96 GPUs using a combination of data, tensor, and pipeline parallelism.
Cerebras GPT was trained on 16 CS-2 computers utilizing conventional data parallelism. This is achievable because the Cerebras CS-2 computers have enough memory to execute even the biggest models without dividing the model. We then constructed the Cerebras Wafer-Scale Cluster around the CS-2 to allow for easy scale-out. It employs weight streaming, a HW/SW co-designed execution that permits independent scalability of model size and cluster size without model parallelism. Scaling to larger clusters is as simple as adjusting the number of systems in a configuration file with this design.
This article is to help you learn about Cerebras-GPT. We trust that it has been helpful to you. Please feel free to share your thoughts and feedback in the comment section below.