Why does the initialization matter?

A notebook by Sylvain Gugger (paraphrased by me)
import torch

If we take a vector x of shape (1x512) and a matrix a of shape (512x512), initialize them randomly, and multiply it by a pseudo 100 layer network (aka multiply them 100 times), what happens?

x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x

x.mean(), x.std()
(tensor(nan), tensor(nan))

This phenominon is called activation explosion, where the activations go to NaN. We can figure out exactly when that happens:

x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x
    if x.std() != x.std():
        break
x = torch.randn(512)
a = torch.randn(512,512)

for i in range(100):
    x = a @ x
    if x.std() != x.std():
        break

if x.std() != x.std()

NaN numbers will always return False against itself

i
27

It took 26 multiplication before the gradients died and the activations could no longer be kept track of.

What happens instead if we make a extremely small and have the activations scale slowly?

x = torch.randn(512)
a = torch.randn(512,512) * 0.01
for i in range(100):
    x = a @ x
    
x.mean(), x.std()
(tensor(0.), tensor(0.))

All of our activations vanished this time. The initialization matters immensely to get a decent starting point.

This is also why for decades you couldn’t train deep nn’s.

Now go back to the other notebook to “To Twitter We Go”