Why does the initialization matter?

A notebook by Sylvain Gugger (paraphrased by me)

import torch

If we take a vector x of shape (1x512) and a matrix a of shape (512x512), initialize them randomly, and multiply it by a pseudo 100 layer network (aka multiply them 100 times), what happens?

x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x

x.mean(), x.std()

(tensor(nan), tensor(nan))

This phenominon is called activation explosion, where the activations go to NaN. We can figure out exactly when that happens:

Code
Code + Explanation

x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x
    if x.std() != x.std():
        break

x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x
    if x.std() != x.std():
        break

if x.std() != x.std()

NaN numbers will always return False against itself

It took 26 multiplication before the gradients died and the activations could no longer be kept track of.

What happens instead if we make a extremely small and have the activations scale slowly?

x = torch.randn(512)
a = torch.randn(512,512) * 0.01
for i in range(100):
    x = a @ x
    
x.mean(), x.std()

(tensor(0.), tensor(0.))

All of our activations vanished this time. The initialization matters immensely to get a decent starting point.

This is also why for decades you couldn’t train deep nn’s.

Now go back to the other notebook to “To Twitter We Go”