February 7, 2024

Huggingface Accelerate to train on multiple GPUs.

PyTorch is a simple and stable framework for building deep learning solutions. Its simplicity offers freedom and control over the complete code.

While using PyTorch is fun 😁, sometimes we have to do a lot of things manually. Some of the things we may want to do are

  • Placing model on a device
  • Placing tensors on a device
  • Use fp16/fp32/bf16
  • Use different accelerators like Nvidia GPU, Google TPU, Graphcore IPU and AMD GPU.
  • Use optimization library like DeepSpeed from Microsoft
  • Use FullyShardedDataParallel

Over the last few years, various libraries have tried to address these issues by offering different layers of abstraction. Some of the libraries are

Each library comes with its pros and cons. Each has its learning curve and different levels of abstraction.

What is Huggingface accelerate

Huggingface accelerate allows us to use plain PyTorch on

  • Single and Multiple GPU
  • Used different precision techniques like fp16, bf16
  • Use optimization libraries like DeepSpeed and FullyShardedDataParallel

To take all the advantage, we need to

  • Set up your machine
  • Create a configuration
  • Adopting PyTorch code with accelerate
  • Launch using accelerate



The code used in the blog is available here in Github.


Set up your machine

Before we use accelerate, we need to have access to an instance with

  • GPUs (a minimum of 1)
  • All the required Nvidia libraries
  • A working installation of PyTorch

You can also easily create an instance with Jarvislabs, which comes with

  • All the required environment
  • Super simple to switch between single / multi gpus instance.
  • Access the instance through JupyterLab, VScode, and SSH.

Check out our quick start guide here.

Once you have an instance, you can simply install πŸ€— accelerate.

pip install accelerate

Create a configuration

πŸ€— accelerate uses a config file for managing how to run the PyTorch program. To create the config file, we will use

accelerate config

which will ask us a couple of questions like

  • In which compute environment are you running?
    • 0 This machine
    • 1 AWS (Amazon SageMaker)
  • Which type of machine are you using?
    • 0 No distributed training
    • 1 multi-CPU
    • 2 multi-GPU
    • 3 TPU
  • How many different machines will you use (use more than 1 for multi-node training)? [1]
  • Do you want to use DeepSpeed? [yes/NO]
  • Do you want to use FullyShardedDataParallel? [yes/NO]
  • How many GPU(s) should be used for distributed training? [1]
  • Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]

Based on the answers we provide, a configuration file is created which will be used by πŸ€— library to run the PyTorch program.

In my case the configuration file looks like this

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
use_cpu: false

:::info The file is located in ~/.cache/huggingface/accelerate, while using Jarvislabs instances. :::

Adopting PyTorch code with accelerate

Let’s take a look at the below training_step function, which shows a simplified version of what we would use for training. The below code can get complicated as we try to add more features like

  1. Mixed Precision
  2. Multiple GPUs
  3. Using accelerators like DeepSpeed
def training_step(epoch, train_dl,model,optimizer,loss_fn,mb):
    model = model.to(device)
    for batch_idx, (xb, yb) in enumerate(progress_bar((train_dl), parent=mb)):
        xb = xb.to(device)
        yb = yb.to(device)
        output = model(xb).view(-1)
        loss = loss_fn(output,yb)
        for param in model.parameters():
            param.grad = None

With a few changes to the code, we can quickly use Accelerate and run our PyTorch code with mixed-precision and multi GPUs. Some of the key changes are

  1. Initialise
accelerate = Accelerator()
  1. Move tensors, models, and optimizers to corresponding devices like GPU
model, optimizer, train_dl = accelerate.prepare(model, optimizer, train_dl)
  1. Modify the training step,

a. Remove any code which manually places tensor into a particular device(GPU)

b. Instead of calling backward on loss, call it using accelerate.backward

def training_step(epoch, train_dl,model,optimizer,loss_fn,mb):
    for batch_idx, (xb, yb) in enumerate(progress_bar((train_dl), parent=mb)):
        output = model(xb).view(-1)
        loss = loss_fn(output,yb)
        for param in model.parameters():
            param.grad = None

Launch using accelerate

Once we have configured using accelerate config and modified the PyTorch code, we can lauch our training by running the below command on a terminal.

accelerate launch accelerate_classifier.py

Next steps

We just scratched the surface of what we can do using πŸ€— accelerate library. There are a lot of examples available in the official github repo.

You can take a look at the various examples available for inspiration.


We have learned how Huggingface accelerate helps in quickly running the same PyTorch code with

  • Single/Multi GPU
  • Different accelerators like GPU and TPU
  • Use different precisions like fp16 and fp32

If you are looking for affordable GPU instance to train your deeplearning models, check out Jarvislabs.