nbdistributed

An introduction to nbdistributed, what will be powering our course

Why nbdistributed

This course is enabled using a powerful plugin I created called nbdistributed.

Typically when we think about distributed programming with PyTorch, we use something called torch.distributed (the distributed package). This works great as a script, however for educational content I still find that notebooks are still superior because you can sit and play with them, test various combinations of code out, and overall just be able to go at something to your hearts content.

What problems existed before?

Before the nbdistributed plugin, you would need to use something like accelerate’s notebook_launcher to run code distributed in Jupyter, which required a few annoying bits:

  1. You could not interact with CUDA in the notebook before calling it. This meant that if you wanted to test anything, it had to be done in what was being “executed” later
  2. Everything that would go on in the distributed program would need to be “launched” via a function, e.g.:
def my_training_func(model, dataloader):
    model.to(current_device)
    for batch in dataloader:
        batch.to(current_device)
        out = model(**batch)
notebook_launcher(my_training_func)
  1. This was non-interactive. Basically once you hit the CUDA part, you were essentially just running a script. Not very useful to make use of being in a notebook!

How does nbdistributed deal with this

We spawn the torch.distributed when we init the plugin, telling it how many GPUs we want to use

%load_ext nbdistributed
%dist_init --num-processes 2 --gpu-ids 3,4

With the plugin, we can define whether something can be ran on what subset of GPUs, and even (as we’ll see later) handle asynchronous distributed operations in a cell-by-cell fashion

Errors

Errors populate only the process_group “thread”, meaning that you can rerun cells again to fix it without needing to restart from the beginning.

For example, what would happen normally if we were to try to print an undefined variable?

%%rank [0]
t = torch.tensor([1,2,3]).to(device); t
print(t)

Now we can re-define the variable on only rank 1:

%%rank [1]
t = torch.tensor([1,2,3]).to(device)

And now it works!

t

We can then shut everything down once we’re finished:

%dist_shutdown
print("hello")
%dist_status

Check your knowledge:

  1. What is nbdistributed?

  2. How does its design compare to existing attempts at doing distributed training in notebooks?

  3. What variables are auto-populated for you when you initialize?