= get_data()
x_train,y_train,x_valid,y_valid = x_train.mean(), x_train.std()
train_mean,train_std train_mean,train_std
(tensor(0.1304), tensor(0.3073))
normalize (x, mean, std)
get_data ()
x_train,y_train,x_valid,y_valid = get_data()
train_mean,train_std = x_train.mean(), x_train.std()
train_mean,train_std
(tensor(0.1304), tensor(0.3073))
We need to normalize our data (mean ~= 0, std ~=1) by the training data, so they are on the same scale. If we did not then they could be considered two completely different datasets as a whole, and not actually part of the same bunch
(tensor(2.1425e-08), tensor(1.))
test_near_zero (a, tol=0.001)
We initialize with a simplified version of kaiming init / he init
The size of our fully-connected hidden layer (nodes)
One weight for our model, the first layer initialized (784,50)
The bias for that weight
Another weight for our model, the second layer (50,1)
The bias for that weight
Simplified kaiming init/he init
(torch.Size([784, 50]), torch.Size([50]), torch.Size([50, 1]), torch.Size([1]))
# So should this because we used kaiming init which is designed to have this effect
t.mean(), t.std()
(tensor(-0.0417), tensor(1.0341))
While there are other ways of writing that, if you can find a function attached to a tensor for the thing you want to do, it will almost always be faster because it will be written in C - Jeremy Howard
Uh oh! What went wrong?
Basically we took everything with a mean below zero and just got rid of it. As a result we lost a ton of good data points, and our standard deviation and mean drastically swong as a result.
\[\operatorname{std}=\sqrt{\frac{2}{\left(1+a^2\right) \times \text { fan_in }}}\]
Solution is to stick a two on the top:
(tensor(0.5535), tensor(0.8032))
While this solved the standard deviation, our mean is now half because we still deleted everything below the mean
(tensor(0.0372), tensor(0.8032))
Jeremy tried seeing just what would happen if during relu we reduced it by .5, and it seems to have helped some in returning us to the correct mean:
How well does this work in practice? – To test, I should try building a very basic CNN and throw it to ImageWoof and the only variance being the ReLU layer being utilized.
(tensor(-0.0279), tensor(0.7500))
2.56 ms ± 578 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note: better to use -1 or 1 than just to do
squeeze()
Chain rule, chain rule, chain rule!
Start with our last function and go backwards:
Gradients need to be attached to the inputs, so it can be passed across all of the functions and utilized as the output of the previous layer is the input for the current layer
This is the derivitive of (inp-targ)^2/len(inp)
The inp>0 is familiar, but given respect we need to multiply it by the previous layer’s gradients
Given that anything negative after a ReLU is set to 0, it has no slope and thus a derivitive of 0. We take everything above 0 as a result
The gradient of a matrix product is the product of the matrix transpose
We need the outputs with respect to the weights
And we also need the outputs with respect to the biases
The inputs to the gradients is the original input, the output, and the rest of the options passed originally
This pattern continues until we start and end on the original linear layer, traveling through the model and loss function twice
Backprop is the chain rule, with making sure all the calculations are saved somewhere
And now we cheat with pytorch autograd
to check results:
Let’s the class be called with ReLU()() and perform an operation
This is our backward pass from earlier, but save it inside self.inp.g
class Linear():
def __init__(self, w, b):
self.w, self.b = w, b
def __call__(self, inp):
self.inp = inp
self.out = inp @ self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
# Creating a giant outer product just to sum it together is very inefficient. Do it all at once!
self.w.g = (self.inp.unsqueeze(-1) * self.out.g.unsqueeze(1)).sum(0)
self.b.g = self.out.g.sum(0)
class Model():
def __init__(self, w1, b1, w2, b2):
self.layers = [Linear(w1,b1), ReLU(), Linear(w2,b2)]
self.loss = MSE()
def __call__(self, x, targ):
for layer in self.layers:
x = layer(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for layer in reversed(self.layers):
layer.backward()
CPU times: user 77.4 ms, sys: 26.8 ms, total: 104 ms
Wall time: 13.2 ms
CPU times: user 3.62 s, sys: 2.41 s, total: 6.03 s
Wall time: 893 ms
class Module():
"Basic class that will impelement .backward() and store the args and outputs from the forward function"
def __call__(self, *args):
self.args = args
self.out = self.forward(*args)
return self.out
def forward(self):
raise NotImplementedError("You need to define the forward funciton still!")
def backward(self):
self.bwd(self.out, *self.args)
class Linear(Module):
def __init__(self, w, b):
self.w, self.b = w, b
def forward(self, inp):
return inp@self.w + self.b
def bwd(self, out, inp):
inp.g = out.g @ self.w.t()
# Creating a giant outer product just to sum it together is very inefficient. Do it all at once!
self.w.g = torch.einsum("bi,bj->ij",inp,out.g)
self.b.g = out.g.sum(0)
class Model():
def __init__(self, w1, b1, w2, b2):
self.layers = [Linear(w1,b1), ReLU(), Linear(w2,b2)]
self.loss = MSE()
def __call__(self, x, targ):
for layer in self.layers:
x = layer(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for layer in reversed(self.layers):
layer.backward()
CPU times: user 142 ms, sys: 0 ns, total: 142 ms
Wall time: 19.9 ms
CPU times: user 234 ms, sys: 157 ms, total: 391 ms
Wall time: 49.7 ms
We have now implemented both of these, and thus we’re allowed to use them
CPU times: user 129 ms, sys: 2.81 ms, total: 131 ms
Wall time: 19.7 ms
CPU times: user 105 ms, sys: 0 ns, total: 105 ms
Wall time: 16 ms