Corresponding Author E-mail: firstname.lastname@example.org
Written August 25, 2020
The introduction of masked language models like BERT and the ability to fine tune them for downstream tasks have given the NLP domain new impulses during the past few years. Though smaller models like DistilBERT increase the efficiency of NLP tasks and allow their deployment on edge devices, their training and fine tuning still consumes a lot of time during their development. In this work, we show how the distillation process can be scaled up using a distributed training scheme and other training techniques. We show, in particular, that the speed of the baseline training can be sped up by a factor of over 13 when moving from a single 8 GPU computer to a cluster with 64 GPUs while retaining the language capabilities of the original model as measured by the loss function and perplexity.
Keywords: DistilBERT, distributed training, perplexity, benchmark
1. Introduction to NLP Models
The field of natural language processing (NLP) studies human languages with the aim of producing a computerized representation of a language. There are many such language models. One of these is the Bidirectional Encoder Representations from Transformers (BERT) , which uses a transformer  to encode an entire sentence all at once without paying attention to the directionality of the words in the sentence. During training, tokenized sentences are passed into the model, from which 15% of the tokens are masked out and the model fills in the gaps with its best prediction. The BERT model is therefore an example of a masked language model, as opposed to a directional language model where the model predicts the next word in a given starting sequence for a sentence.
The BERT loss function takes into account only the prediction of the masked tokens and ignores the prediction of the others. Thus, the model converges slower than directional models, which it makes up for by increased context awareness.
In another training mode of the BERT model, we pass two sentences to it and learn if the second is indeed the subsequent sentence in the original document. The dataset is curated such that one-half of such second sentences are genuine subsequent sentences and the other half are not.
The original large BERT model has 345 million parameters while a smaller base model has 110 million parameters. Such large models are difficult to deploy on the edge as edge devices provide limited computational resources. If we wish to use NLP models on a smartphone or in a car, we need a smaller version of these models that is still functional.
The DistilBERT model is a distilled version of the BERT model  which reduces the number of layers by a factor of 2 making it 40% smaller than the original BERT model. To train the smaller DistilBERT model, a student-teacher training is applied. The distillation method to compress the knowledge of the larger model into the smaller one begins with the observation that the large model is typically trained and evaluated on the one-hot encoding where the output probability distribution has a large value for the correct word and a near-zero probability for every other element. Some of these near-zero elements are larger than others and thus this part of the distribution has structure. Indeed, it is this structure that represents the generalization ability of the model; it is known as dark knowledge. The full model is trained on the one-hot encoding, i.e. the hard targets. The distilled model is trained on the probability distribution of the teacher model, and specifically that part of the distribution that is above a certain minimum cut-off, i.e. the soft targets. The training signal is thus a much richer information source than the original one-hot encoding , . In order to leverage the generalization ability of the BERT teacher model, the original loss function is extended to capture the uncertainty of the teacher. In this way, DistilBERT retains about 97% of the base BERT model’s accuracy while having only 40% of the parameters.
Besides the usual cross-entropy term on the supervised one-hot encodings of the target token, the loss function for DistilBERT comprises two additional penalty terms. First, a cross-entropy loss Lce is applied between the output probabilities of the teacher and those of the student here denoted as ti and si, respectively.
A softmax-temperature is applied to both output distributions to improve convergence. Second, a cosine embedding loss is applied between the teacher’s and the student’s hidden state. The final loss is computed by linearly combining the three loss functions.
To measure the accuracy of causal language models, the model perplexity is often used in the literature. Given a tokenized sequence W = ( w0,w1,…,wt) the perplexity of W is
where N is the number of token in the corpus. Note that in order to compute the pseudo-perplexity for a sequence W the model has to be executed |W | times making its computation impractical for larger datasets.
2. Distributed Training
In training any model, we wish to reduce the total training time by splitting the training across a network of multiple computers, each having multiple GPUs. As long as the model fits into the memory of the processing unit, the data-parallel technique can be applied. Here, the dataset is distributed among several workers (usually one worker per GPU) and in each iteration the worker performs the forward and backward pass on a batch of distinguished samples from the dataset. After computing the gradient on each worker individually, the workers share their local gradients with the other workers in an allreduce operation. Subsequently, the shared gradients are averaged and applied to the model weights.
In order to minimize the network overhead during the synchronization, a ring topology can be utilized. This enables the workers to only share results with their neighboring node in the ring. By repetitively sharing only the sum of the gradients the communication between the nodes is minimized. In our experiments, the Horovod framework  was used to perform the synchronized update during the training. All our experiments were performed on the Brightics AI Accelerator  platform deployed on an AWS cluster . The utilization of the AI Accelerator platform facilitates implementing and executing distributed training as it automatically manages the provisioning of worker nodes and initializes training execution.
3. Methodology and Results
For the purposes of this study, a DistilBERT model is trained on the Gutenberg dataset for three epochs and subsequently tested on the Wikitext-2  corpus as well as the IMDB  sentiment analysis dataset. The former dataset has been extracted from a selection of Wikipedia articles and comprises 2M token while the latter comprises close to 8M tokens.
As a baseline, we use the PyTorch-based distillation example from the transformers library provided by the HuggingFace group . In all cases, the baseline is executed on a single computer with eight V100 GPUs . To fill the GPU memory, a batch size of 9 is selected. The effective batch size is increased by accumulating the gradients over 50 iterations. In total, the baseline implementation takes 438 minutes to execute three training epochs. In attempting to improve the training speed, the following approaches were pursued.
- Increase the video memory of each GPU from 16GB to 32GB.
- Modify the gradient accumulation technique to delay gradient synchronization between workers.
- Use the Horovod framework to use more than a single computer with eight GPUs each.
- Use more than one Horovod ring during the allreduce operation.
For the baseline training implementation, a GPU memory increase by a factor of two and a batch size of 16 rather than 9, decreased the training duration to 402 minutes. Delaying the gradient synchronization between the workers to be performed only once after accumulating the gradients of 50 batches further reduced the training time to 204 minutes. By using Horovod (Hvd) to scale the training from one to four compute nodes with eight GPUs each, the training time is reduced to 70 minutes. Note that the delayed synchronization is the default behavior for the Horovod library. Utilizing multiple rings for the allreduce operation further decreased the training duration by four minutes. Finally, scaling the training to eight nodes of eight GPUs each results in finishing the training within 32 minutes where the multi-ring allreduce gives a performance benefit of five minutes. The timing results are summarized in Table 1: Summarized training durations for three training epochs. We note that the original training run is #1 in table 1.
These results are displayed graphically in figure 1. The bottom curve is the actual performance from table 1. The top curve is the performance one would expect to see if the performance were linear with respect to the baseline of line #2 in table 1. By choosing #2 as our baseline, the further speed improvements are purely software based.
Due to the changes made to increase the training speed, one may ask how the accuracy of the model training is affected. Table 2 displays the effect on training loss while Table 3 and
Table 4 summarize the development of the perplexity and the pseudo-perplexity during training. In order to make the computation of the pseudo-perplexity feasible on the larger IMDB dataset, a randomized subset of 10% was utilized.
Figure 1: Performance scales super-linearly.
From Table 2 to 4, we see that the changed approach does not negatively affect the quality of the model. As expected, we can also see in Table 3 and
Table 4 that the perplexity of the masked language model is much lower than the pseudo-perplexity due to the reasons discussed in Section 1. The improved loss and (pseudo)-perplexity of the multi-node training with respect to the baseline are presumably due to the larger effective batch size. This means that the training speed increase shown in Table 1: Summarized training durations for three training epochs is genuine. The timing results of Table 1: Summarized training durations for three training epochs further show that supplying the model with double the video memory provided a speed-up factor of 1.09 from the baseline while changing the gradient accumulation technique is worth a speed-up factor of 2.0, all other factors being the same. Using distributed training and moving from 8 to 32 GPUs — with each compute node having eight GPUs — provides a speed-up factor of about 3.1 while keeping all other factors the same. Going from one to multiple rings during the allreduce operation, provides a speed-up of 1.06x for 32 GPUs and 1.16x for 64GPUs. So, all of these factors provide a positive training result. Combining all these techniques speeds up training by a factor of 13.69x in moving from 8 to 64 GPUs.
We have demonstrated that using 64 instead of 8 GPUs to train a DistilBERT model (a factor of 8 in resource usage), leads to a reduction in total training time by a factor of 13.69x. This super-linear behavior is explained by using a slightly different training technique (delayed synchronization during gradient accumulation), different hardware technology (doubling of video memory), and different software technology (distributed training). This performance improvement does not affect the accuracy of the model as measured by loss and pseudo-perplexity and thus is a genuine speed improvement.
Finally, we conclude that distributed training demonstrably reduces the training time for a DistilBERT model.
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805v2.
Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish (2017): Attention Is All You Need. arXiv:1706.03762.
Sanh, Victor; Debut, Lysandre; Chaumond, Julien; Wolf, Thomas (2019): DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2014): Distilling the Knowledge in a Neural Network. NIPS 2014 Deep Learning Workshop. arXiv:1503.02531.
Bucila, Cristian; Caruana, Rich; Niculescu-Mizil, Alexandru (2006): Model Compression. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. Pages 535—541.
Chip Huyen (2019): Evaluation Metrics for Language Modeling. The Gradient.
Wang, Chenguang; Li, Mu; Smola, Alexander J. (2019): Language Models with Transformers. arXiv:1904.09408v2.
Salazar, Julian; Liang, Davis; Nguyen, Toan Q.; Kirchhoff, Katrin (2019): Masked Language Model Scoring. arXiv:1910.14659.
Sergeev, Alexander; Del Balso, Mike (2018): Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799
Brightics AI Accelerator is a commercial product of Samsung SDS America. Further details are available at http://xcelerator.ai
Amazon Web Services (AWS) is a commercial cloud computing platform. Further details are available at: https://aws.amazon.com
Merity, Stephen; Xiong, Caiming; Bradbury, James; Socher, Richard (2016): Pointer Sentinel Mixture Models. arXiv:1609.07843
Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher (2011): Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, R’emi; Funtowicz, Morgan; Brew, Jamie (2019): HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv: 1910.03771
V100 is a particular model of a GPU produced by NVIDIA. Further details are available at: https://www.nvidia.com/en-us/data-center/v100/
In order to set up a distributed training effort similar to the one used in the DistilBERT benchmark above, one would have had to do a significant amount of work on various IT issues with many software tools to make it possible. This would detract from the artificial intelligence work that is at the heart of the task.
We used the Brightics AI Accelerator in order to do the work. This distributed training environment performs an automated setup of the GPU cluster avoiding manual provision of virtual machines, network and network file storage (NFS), as well as operating system and other software and package installations. This abbreviates a manual task that would otherwise take a few weeks into a few minutes.
Integrated Jupyter notebook and terminal interfaces with access to NFS was very useful in setting up code and data for the task. Programming can be done in TensorFlow, Keras, or PyTorch interchangeably.
By using the AutoDL feature, the user’s responsibility is reduced to providing the code for data loading as well as the model to be trained. All additional burden, usually involved in multi-machine training, is lifted from the user by this feature.
We provided the input function (the data set), the model function (the neural network), and several call to launch the job. That is all.
Notably, we did not have to write any cluster programming code (Horovod) or code to pause and resume the cluster for automatic evaluation on validation set. Neither did we have to write any hooks or callback routines. We benefitted from worker local cache with pre-training bash jobs to move data to local NVM storage. Monitoring job progress was easy by using web based UI — ssh was not needed to tail logs. The built-in TensorBoard was instrumental for monitoring the progress of the task.
About Brightics AI Accelerator
Recent research from Salesforce indicates that 83% of IT leaders say that AI and ML is transforming customer engagement, and 69% say that it is transforming their business. Still a fundamental barrier to AI progress is the time it takes to train an AI model. Most AI teams simply accept the weeks and months it takes to train a model on their desktops and migrate it to large, distributed GPU clusters.
To eliminate these barriers, the Samsung SDSA AI Team developed Brightics AI Accelerator. With only a few lines of code activated through a single Jupyter Notebook UI or PyCharm IDE, Brightics AI Accelerator enables professional machine learning scientists and engineers to save time and deliver value up to hundreds of times faster by automating distributed AI network training. Companies that adopt Brightics AI Accelerator are also able to reduce cost by making more efficient use of their existing IT, DevOps, machine learning, and deep learning resources.
In addition to producing unmatched benchmark results, the Brightics AI Accelerator delivers an unrivaled AutoPilot user experience in terms of push-button ease-of-use, speed, and efficiency.
Easy-to-Use: Through automated cluster setup of VMs, network and NFS in addition to OS/CUDA and package installations, the AI accelerator makes it easy to install and run with the push of a button. Integrated Jupyter Notebook and Terminal interfaces with access to NFS are extremely useful for setting up code and data, and users can orchestrate large job setup, data preparation, training, inference and tear-down with just one Python API call. They no longer waste time configuring ssh tunnels to tail logs per worker. It is all done automatically, and the web-based UI simplifies monitoring job progress through intuitive, built-in TensorBoard data visualization.
Fast: Companies can shrink data preparation, training, and optimization up to hundreds of times while allowing Machine Learning and Deep Learning Scientists to add value without DevOps, IT or distributed clustering expertise. With the AI Accelerator, lift-and-shift deep learning AI model training from desktop to automated, distributed clusters up to 512 GPUs to produce a model in 1 hour versus 3 weeks — roughly dividing the training time by the number of GPUs.
Efficient: Under the hood, data-parallel, synchronous Horovod Ring-All-Reduce distributed training scales data point throughput near linearly with an increasing number of GPUs connected via high-speed interconnect, so quick model convergence to state-of-the-art accuracy occurs in supercomputer challenge speed. The Automated Deep Learning (AutoDL) feature eliminates the need to write any code to program the cluster; run, pause or resume jobs; or evaluate validation data. The only code required is that of an input function, a model estimator function and only a few common API calls to launch a training job … that’s it! Worker local cache with pre-training bash jobs moves data to local NVM storage and accelerates data throughput into model training over and above that provided by data parallel, distributed Horovod stochastic gradient descent (SGD).