import torchWhy does the initialization matter?
If we take a vector x of shape (1x512) and a matrix a of shape (512x512), initialize them randomly, and multiply it by a pseudo 100 layer network (aka multiply them 100 times), what happens?
x = torch.randn(512)
a = torch.randn(512,512)
for i in range(100):
x = a @ x
x.mean(), x.std()(tensor(nan), tensor(nan))
This phenominon is called activation explosion, where the activations go to NaN. We can figure out exactly when that happens:
x = torch.randn(512)
a = torch.randn(512,512)
for i in range(100):
x = a @ x
if x.std() != x.std():
breakx = torch.randn(512)
a = torch.randn(512,512)
for i in range(100):
x = a @ x
if x.std() != x.std():
breakif x.std() != x.std()NaN numbers will always return False against itself
i27
It took 26 multiplication before the gradients died and the activations could no longer be kept track of.
What happens instead if we make a extremely small and have the activations scale slowly?
x = torch.randn(512)
a = torch.randn(512,512) * 0.01
for i in range(100):
x = a @ x
x.mean(), x.std()(tensor(0.), tensor(0.))
All of our activations vanished this time. The initialization matters immensely to get a decent starting point.
This is also why for decades you couldn’t train deep nn’s.
Now go back to the other notebook to “To Twitter We Go”