nbdistributed

An introduction to nbdistributed, what will be powering our course

Why `nbdistributed`

This course is enabled using a powerful plugin I created called nbdistributed.

Typically when we think about distributed programming with PyTorch, we use something called torch.distributed (the distributed package). This works great as a script, however for educational content I still find that notebooks are still superior because you can sit and play with them, test various combinations of code out, and overall just be able to go at something to your hearts content.

What problems existed before?

Before the nbdistributed plugin, you would need to use something like accelerate’s notebook_launcher to run code distributed in Jupyter, which required a few annoying bits:

You could not interact with CUDA in the notebook before calling it. This meant that if you wanted to test anything, it had to be done in what was being “executed” later
Everything that would go on in the distributed program would need to be “launched” via a function, e.g.:

def my_training_func(model, dataloader):
    model.to(current_device)
    for batch in dataloader:
        batch.to(current_device)
        out = model(**batch)
notebook_launcher(my_training_func)

This was non-interactive. Basically once you hit the CUDA part, you were essentially just running a script. Not very useful to make use of being in a notebook!

How does `nbdistributed` deal with this

We spawn the torch.distributed when we init the plugin, telling it how many GPUs we want to use

%load_ext nbdistributed

%dist_init --num-processes 2 --gpu-ids 3,4

Using GPU IDs: [3, 4]
Starting 2 distributed workers...
✓ Successfully started 2 workers
  Rank 0 -> GPU 3
  Rank 1 -> GPU 4
Available commands:
  %%distributed - Execute code on all ranks (explicit)
  %%rank [0,n] - Execute code on specific ranks
  %sync - Synchronize all ranks
  %dist_status - Show worker status
  %dist_mode - Toggle automatic distributed mode
  %dist_shutdown - Shutdown workers

🚀 Distributed mode active: All cells will now execute on workers automatically!
   Magic commands (%, %%) will still execute locally as normal.

🐍 Below are auto-imported and special variables auto-generated into the namespace to use
  `torch`
  `dist`: `torch.distributed` import alias
  `rank` (`int`): The local rank
  `world_size` (`int`): The global world size
  `gpu_id` (`int`): The specific GPU ID assigned to this worker
  `device` (`torch.device`): The current PyTorch device object (e.g. `cuda:1`)

With the plugin, we can define whether something can be ran on what subset of GPUs, and even (as we’ll see later) handle asynchronous distributed operations in a cell-by-cell fashion

Errors

Errors populate only the process_group “thread”, meaning that you can rerun cells again to fix it without needing to restart from the beginning.

For example, what would happen normally if we were to try to print an undefined variable?

%%rank [0]
t = torch.tensor([1,2,3]).to(device); t


🔹 Rank 0:
  tensor([1, 2, 3], device='cuda:3')

print(t)


🔹 Rank 0:
  tensor([1, 2, 3], device='cuda:3')

❌ Error on Rank 1: name 't' is not defined
Traceback (most recent call last):
  File "/home/zach/nbdistributed/src/nbdistributed/worker.py", line 284, in _execute_code_streaming
    result = eval(compile(tree, '<string>', 'eval'), self.namespace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
NameError: name 't' is not defined

Now we can re-define the variable on only rank 1:

%%rank [1]
t = torch.tensor([1,2,3]).to(device)

And now it works!


🔹 Rank 0:
  tensor([1, 2, 3], device='cuda:3')

🔹 Rank 1:
  tensor([1, 2, 3], device='cuda:4')

We can then shut everything down once we’re finished:

%dist_shutdown

Shutting down distributed workers (nuclear option)...
Starting force shutdown...
Force shutdown completed
Distributed workers shutdown
📱 Normal cell execution restored

print("hello")

hello

%dist_status

No distributed workers running

Check your knowledge:

What is nbdistributed?
How does its design compare to existing attempts at doing distributed training in notebooks?
What variables are auto-populated for you when you initialize?

Why nbdistributed

What problems existed before?

How does nbdistributed deal with this

Errors

Check your knowledge:

Why `nbdistributed`

How does `nbdistributed` deal with this