Neural Networks Learning Notes

All Screenshots below are from StatQuest.

Part1: Inside the Black Box

Backgrounds of Example

CleanShot_2021-12-10_at_22.22.37

A neural_networks consist of nodes and connections between the nodes.

CleanShot_2021-12-10_at_22.31.34

The network starts out with unknown parameter values that are estimated using Back Propagation

CleanShot_2021-12-10_at_22.38.04

CleanShot_2021-12-10_at_22.38.54

There are many coomon bent or curved linesthat we can choose for a Neural Network.

CleanShot_2021-12-10_at_22.42.11
  1. CleanShot_2021-12-10_at_22.42.35
  2. CleanShot_2021-12-10_at_22.42.54

The curves are called Activation Functions

  • Hidden Layers: nodes between input and output. you have to decide how many layers you want and how many nodes go into each hidden layer

In the connection of nodes, the parameter we add are called biases and the param we multiply

Part2: BackPropagation

  1. using the Chain Rule to calculate derivatives.
  2. Plugging the derivatives into Gradient Descent to optimize params.
CleanShot_2021-12-11_at_00.04.34

Frequently initialized to 0.

CleanShot_2021-12-11_at_00.05.20
  • quantify how well the green squiggle fits the data by calculating SSRs

the process of Gradient Descent

  • Then caculate the optimal value for where we can get the smallest SSR (use Gradient Descent to find this value relatively quickly) CleanShot_2021-12-11_at_00.09.07
CleanShot_2021-12-11_at_00.10.29 CleanShot_2021-12-11_at_00.13.37

CleanShot_2021-12-11_at_00.15.10

using the step size we can get the new = old one + step_size

Details p1

How the Train Rule and Gradient Descent used in optimizing multiple parameters.

Assuming we have to optimize.

  1. Randomly select 2 values for w3,4 from a Standard Normal Distribution

  2. CleanShot_2021-12-11_at_10.02.11@2x
  3. Optimize the even if are not optimal.

    The reason why it works is that dSSR/db_3 is not related to w_3,w_4.

  4. CleanShot_2021-12-11_at_10.14.16@2x
  5. calculating the derivatives

CleanShot_2021-12-11_at_10.15.38@2x CleanShot_2021-12-11_at_10.17.30@2x CleanShot_2021-12-11_at_10.19.17@2x
  • represent the value got by plugging in into the Activation Function.
  • represents the value which we got in the connection of by plugging in th value of
  1. calculate the Step Size. Stepsize = derivativelearning_rate
  2. new (weight || bias) = old_one - stepsize

Repeat the steps until:

  • the performance not improved very much
  • reach the step limits
  • …or meet other criteria

part2

CleanShot_2021-12-11_at_10.39.21@2x CleanShot_2021-12-11_at_10.41.22@2x CleanShot_2021-12-11_at_10.42.57@2x CleanShot_2021-12-11_at_10.45.17@2x

CleanShot_2021-12-11_at_10.53.54@2x

Part3 ReLU

CleanShot_2021-12-11_at_11.08.15@2x

Part4

CleanShot_2021-12-11_at_11.27.47@2x

Part5 ArgMax and SoftMax

We can’t use Argmax for BackPropagation

When people want to use ArgMax for output, they often use SoftMax for training.

CleanShot_2021-12-12_at_23.29.08

The output values of SoftMax range from 0~1

The value of the SoftMax output can be interpreted as Predicted “probabilities”, to some extent. We cannot put much trust on it because this so-called probability is depend on the weights and biases which may just be initialized to a random value.

SoftMax used in Back Propagation

and….

CleanShot_2021-12-13_at_00.03.30

When we use the SoftMax function, the output becomes the “probability” between 0 and 1.

Cross Entropy, thus, is often used to determine how well the Neural Network fits the data.

Part6 Cross Entropy

CleanShot_2021-12-13_at_00.24.47 CleanShot_2021-12-13_at_00.22.25

CleanShot_2021-12-13_at_00.23.02

When calculating CrossEntropy_i, only the Observed_i =1, others are 0.

  • Total errors are the sum of Cross Entropy
  • Residual = 1 - predicted_probability.

Part7: Cross Entropy and Backpropagation

CleanShot_2021-12-13_at_15.26.09@2x CleanShot_2021-12-13_at_15.32.38@2x

BAM!!

CleanShot_2021-12-13_at_15.43.48@2x CleanShot_2021-12-13_at_15.44.30@2x

WHY?

The total cross entropy is calculated by…(where n is the number of data points)

When we want to take the derivatives, in this case, of b_3

For every predicted value, only one component(in this case Setosa) is determined by b_3, so

CleanShot_2021-12-13_at_15.50.19@2x

Part 8: Convolutional Neural Networks

CleanShot_2021-12-17_at_16.44.28CleanShot_2021-12-17_at_16.44.53

3Things:

  1. Reduce the number of input nodes
  2. Tolerate small shifts in where the pixels are in the image.
  3. Take advantage of the correlations that we observe in complex images

Steps:

  1. Take a Filter(aka Kernel). The intensity of each pixel in the filter is determined by Backpropagation

  2. CleanShot_2021-12-17_at_16.48.33

    And we can say the Filter is convolved with the input.

  3. Move over one pixel and we can get a Feature Map

    Each cell in the Feature Map corresponds to a group of neighboring pixels.

    CleanShot_2021-12-17_at_16.52.21
  4. Run the feature map through ReLU function.

    CleanShot_2021-12-17_at_16.56.16

  5. We simply select the maximum value ? and this filter usually moves in such a way that it does not overlap itself.

    • When we select maximum value in each region, we are applying Max Pooling
    • when calculating the average value for each region, this would be called Average or Mean Pooling
  6. CleanShot_2021-12-17_at_17.01.17