For greyscale, 1 number indicating range from white to black For color, 3 numbers indicating intensity of red, green, blue
MNIST_SAMPLE
dataset structured? Why?
Structured similar to imagenet Main folder -> valid, train valid,train -> separate folders for each category
Pixel similarity is where you find the ideal number by averaging all the images of the “number”, then you take the mean absolute difference or mean squared difference between your input and the ideal image
List comprehension is a fast way to create a list from an iterator a = [x for x in range(10)]; [2*x for x in a if x%2 > 0]
a tensor with 3 dimensions
shape gives you information on how big each axis is. Rank is length(shape)
RMSE and L1 norm are methods to measure difference between two things. RMSE stands for root mean squared error. L1 norm is the mean absolute difference.
Vectorize the calculations, and then use a gpu
a = (torch.tensor(range(9))*2).view(3,3); a[1:3, 1:3]
broadcasting is when a math operation happens between two inputs with different shapes (usually 1 dimension), and the smaller shape automatically becomes the shape of the larger shape
validation set in order to make sure model is generalizing and not overfitting
SGD stands for stochastic gradient descent, and is a method to optimize neural network weights.
Speeds up training, rather than going 1 example at a time
Initialize parameters, forward prop, calculate loss, backprop, update weights, repeat, stop
Randomly. Although with certain activations, there are certain statistics we use with the random initialization. Eg ReLU -> he intialization
loss is a measure the algorithm uses to optimize itself
a high learning rate could cause the loss to increase after each update
gradient is the slope of a function at a point
If you use a framework, no. PyTorch does it for you
Accuracy is not fine-grain enough. Most gradients would be close to 0, and the updating would be slow
Goes between 0 and 1. 1/(1+exp(-x)). Crosses 0.5 at x=0
Loss is the measure the algorithm uses to evaluate performance. Metric is what humans use.
w = w - lrparams.grad, or params.data -= lrparams.grad
DataLoader
class do?
dataloader takes a list of tuples of (x,y), and batches/shuffles them for your model
forward prop, calc loss, backprop, update weight pred = model(x) loss = loss_func(pred, y) loss.backward() p -= p.grad*lr p.grad.zero_()
[1,2,3,4]
and 'abcd'
, returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
. What is special about that output data structure?
[x for x in zip([1,2,3,4], ‘abcd’)]
view
do in PyTorch?
view does reshaping
bias parameters are randomly initialized parameters used in the forward prop equation. y = w*x + b. B lets the equation center itself.
@
operator do in python?
@ = np.matmul = element-wise matrix multiplication
backward
method do?
backward calculates the gradients from the given point eg loss.backward()
so that the update is reflective of the current step, and not of previous steps loss.backward() accumulates gradients in each function
Learner
?
Learner(dls, net, metrics) Learner(dls, net, opt_func, loss_func, metrics)
for i in range(epochs): for xb,yb in dl: calc_grad(xb, yb, model) p.data -= p.grad.data*lr p.grad.zero_()
-2
to +2
.
relu(x) = max(x,0). Relu is 0 for x 0 or less, and = x for x > 0
activation function is a nonlinearity applied to the linear y=wx+b equation
F.relu
and nn.ReLU
?
F.relu is the function. nn.Relu is a PyTorch module (layer) that does the same thing.
We use more nonlinearities, which means more layers, for better performance. For the same performance, deeper networks tend to need less memory/calculations.