Skip to main content

Training Large NLP Models Efficiently with DeepSpeed Hugging Face

· 5 min read
Tanul Singh

DeepSpeed With the recent advancements in NLP, we are moving towards solving more and more sophisticated problems like Open Domain Question Answering, Empathy in Dialogue Systems, Multi-Modal Problems, etc but with this, the parameters associated with the models have also been rising and have gone to the scale of billions and even Trillions in the largest model Megatron.

Most often Large models beat the smaller models in terms of performance but the training and deployment of such models require high-end hardware which can be very costly. So is there any way in which we can train large NLP models in a limited hardware setting and in a budget? This is where DeepSpeed comes in, It is a deep learning optimization library that makes distributed training easy, efficient, and effective. DeepSpeed delivers extreme-scale model training for everyone. With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of the art.

In this article, We will learn how to effectively use DeepSpeed Library with a single GPU and how to integrate it with HuggingFace Trainer API.

When running DeepSpeed on a single GPU, it helps in the following ways:-

  1. It has a ZeRO-offload feature that can delegate some computations and memory to the host’s CPU and RAM, and thus leave more GPU resources for the model’s needs — e.g. larger batch size, or enabling a fitting of a very big model which normally won’t fit.
  2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit bigger models and data batches.

For the purpose of demonstration, we will be using data from ongoing FeedBack Prize Competition on Kaggle. We will be training a Lonformer Base for Token Classification with a batch size of 10 on a 16GB card which otherwise would not have been possible. We can also fit Lonformer Large with a batch size of 4 using DeepSpeed but since Kaggle has a CPU memory of only 16GB, I was not able to demonstrate that. I was able to train Longformer Large on a BS of 6 on a 24GB RTX 6000 card on using DeepSpeed and HF Trainer.

ds_config_dict = {
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1

"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"

"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

Note: Since both DeepSpeed Arguments and TrainingArguments are passed to the trainer, the common arguments like optimizer Beta’s, LR, Scheduler Params should be in agreement in both. In order to avoid conflicts, we can pass “auto”, in which case Trainer automatically picks the best parameters, for more info please read here

Every Feature in DeepSpeed is mapped as a key in config dict, FP16 is used for enabling/disabling Mixed Precision. We can configure/Use optimizer and scheduler implementations from either HF or DeepSpeed (DeepSpeed has optimizers like 1 Bit Adam, 1 Bit LAMB which can run on GPU’s/CPU’s with minimal communication between different partitions ).

The zero_optimization section of the configuration file is the most important part docs since that is where you define which ZeRO stages you want to enable and how to configure them, there are three optimizer stages to chose from Stage 1, Stage 2 and Stage 3 . Stage 1 enables only optimizer state partitioning whereas Stage 2 enables optimizer+gradient state partitioning and Stage 3 enables all three optimizer+ gradients + parameter partitioning.

Now if you are getting OOMs and have utilized CPU memory (RAM) you can use Stage 2 to offload Optimizer Parameters to CPU, at the same time it also gives you options to configure partition bucket size. If you are still getting OOMs, and still have more CPU memory you can use stage 3 optimization to offload optimizer as well as parameters to CPU thereby freeing more GPU memory. Please note that DeepSpeed only offloads parameters that are not needed at every step and gathers them back. For more info on parameters of zero_optimization pleaser refer to docs

Once you have got the configuration finalized, all you need to do is pass this dictionary to the TrainingArgument like the following

TrainingArguments(..., deepspeed=ds_config_dict)

To fully understand the end to end working, here is the Kaggle Notebook which showcases an example of running Lonformer Large for Token Classification on the Feedback Prize Competition. Please note that the current example was just with HF Trainer, we will be back with another example explaining Integration of DeepSpeed with Native Pytorch Pipeline.