Skip to main content

Hugging Face Accelerate to train on multiple GPUs.

· 3 min read
Vishnu Subramanian

In the last few years, I have used plain PyTorch and various frameworks that abstract the boilerplate code required for writing PyTorch code. In 2021, I came across this tweet,

announcing the launch of Accelerate a library that helps in just removing the boilerplate code, with all the flexibility that PyTorch offers. In this post, I wanted to show a simple solution written in pure PyTorch for predicting how popular a particular pet is. Our focus is to see how we can reduce the boilerplate code like pushing tensors to GPU, taking advantage of mixed precision, and train in multiple GPUs. Since the objective is to demonstrate how to use Accelerate, we will not be writing the full code which would be otherwise important. The code is available on github.

Key steps in a PyTorch pipeline

Let’s take a look at the below training_step function, which shows a simplified version of what we would use. The below code could get complex as we try to add more features like trying to use

  1. Mixed Precision
  2. Multiple GPUs
def training_step(epoch, train_dl,model,optimizer,loss_fn,mb):
model.train()
model = model.to(device)
for batch_idx, (xb, yb) in enumerate(progress_bar((train_dl), parent=mb)):
xb = xb.to(device)
yb = yb.to(device)
output = model(xb).view(-1)
loss = loss_fn(output,yb)
loss.backward()
optimizer.step()
for param in model.parameters():
param.grad = None

Basic PyTorch pipeline with accelerate

With a few changes to the code, we can quickly use Accelerate and run our PyTorch code with mixed-precision and multi GPUs. Some of the key changes are

  1. Initialise
   accelerate = Accelerator()
  1. Move tensors, models, and optimizers to corresponding devices like GPU
   model, optimizer, train_dl = accelerate.prepare(model, optimizer, train_dl)
  1. Modify the training step,

    a. Remove any code which manually places tensor into a particular device(GPU)

    b. Instead of calling backward on loss, call it using accelerate.backward

def training_step(epoch, train_dl,model,optimizer,loss_fn,mb):
model.train()
for batch_idx, (xb, yb) in enumerate(progress_bar((train_dl), parent=mb)):
output = model(xb).view(-1)
loss = loss_fn(output,yb)
accelerate.backward(loss)
optimizer.step()
for param in model.parameters():
param.grad = None
  1. Config the accelerate and run the program. This is mostly required to do once per machine. From the terminal open accelerate config and answer the below questions-

    • In which compute environment are you running? ((0) This machine, (1) AWS (Amazon SageMaker)): 0
    • Which type of machine are you using? ((0) No distributed training, (1) multi-CPU, (2) multi-GPU, (3) TPU):
    • How many different machines will you use (use more than 1 for multi-node training)? (1): 1
    • Do you want to use DeepSpeed? (yes/NO): NO
    • How many processes in total will you use? (1): 2
    • Do you wish to use FP16 (mixed precision)? (yes/NO): yes

I ran code on an instance at Jarvislabs.ai containing 2 GPUs. Once you answer these questions you can run the program using -

accelerate launch accelerate_classifier.py

Conclusion

To my knowledge, this is one of the easiest ways to convert your PyTorch code to take advantage of using advanced features like MixedPrecision and Multi GPUs. I hope you enjoyed reading it.