import torch
Why does the initialization matter?
If we take a vector x
of shape (1x512)
and a matrix a
of shape (512x512)
, initialize them randomly, and multiply it by a pseudo 100 layer network (aka multiply them 100 times), what happens?
= torch.randn(512)
x = torch.randn(512,512)
a
for i in range(100):
= a @ x
x
x.mean(), x.std()
(tensor(nan), tensor(nan))
This phenominon is called activation explosion, where the activations go to NaN. We can figure out exactly when that happens:
= torch.randn(512)
x = torch.randn(512,512)
a
for i in range(100):
= a @ x
x if x.std() != x.std():
break
= torch.randn(512)
x = torch.randn(512,512)
a
for i in range(100):
= a @ x
x if x.std() != x.std():
break
if x.std() != x.std()
NaN numbers will always return False
against itself
i
27
It took 26 multiplication before the gradients died and the activations could no longer be kept track of.
What happens instead if we make a
extremely small and have the activations scale slowly?
= torch.randn(512)
x = torch.randn(512,512) * 0.01
a for i in range(100):
= a @ x
x
x.mean(), x.std()
(tensor(0.), tensor(0.))
All of our activations vanished this time. The initialization matters immensely to get a decent starting point.
This is also why for decades you couldn’t train deep nn’s.
Now go back to the other notebook to “To Twitter We Go”